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The tremendous growth and vitality of 

psychology and its increasing fusion with 
the social and biological sciences demand a new approach to teaching at the 
introductory level. The basic course, geared as it usually is to a single text 
that tries to skim everything—that sacrifices depth for superficial breadth 
—is no longer adequate. Psychology has become too diverse for any one 
man, or a few men, to write about with complete authority. The alterna- 
tive, a book that ignores many essential areas in order to present more 
comprehensively and effectively a particular aspect or view of psychology, 
is also insufficient. For in this solution, many key areas are simply not 
communicated to the student at all. 

The Foundations of Modern Psychology is a new and different ap- 
proach to the introductory course. The instructor is offered a series of 
short volumes, each a self-contained book on the special issues, methods, 
and content of a basic topic by a noted authority who is actively con- 
tributing to that particular field. And taken together, the volumes cover 
the full scope of psychological thought, research, and application. 

The result is a series that offers the advantage of tremendous flexibility 
and scope. The teacher can choose the subjects he wants to emphasize and 
present them in the order he desires. And without necessarily sacrificing 
breadth, he can provide the student with a much fuller treatment of 
individual areas at the introductory level than is normally possible. If 
he does not have time to include all the volumes in his course, he can 
recommend the omitted ones as outside reading, thus covering the full 
range of psychological topics. 

Psychologists are becoming increasingly aware of the importance of 
reaching the introductory^student with high-quality, well-written, and 
stimulating material, material that highlights the continuing and exciting 
search for new knowledge. The Foundations of Modern Psychology Series 
is our attempt to place in the hands of instructors the best textbook tools 


for this purpose. 


Preface 


An elementary statistics teacher is often 

reminded of the popular record album 
“Classical Music for People Who Hate Classical Music." This book has 
grown from my attempts to teach the basic principles of psychological 
measurement to people who do not consider these principles to be of 
central importance in their life and work. We usually find, for example, 
that first-year psychology students do not wish to be mathematical statis- 
ticians. They realize, however, that the courses in psychology they intend 
to take require some knowledge of quantitative research methods and test 
techniques. Similarly, most teachers and counselors do not see themselves 
as test experts, but they know that it is important for them to be able to 
interpret correctly the information about individual students they get (rom 
standardized tests. With the enormous increase in the use of psychological 
tests in many areas of American life, more and more persons need the 
same kind of limited but basic knowledge—parents, businessmen, and 
readers of popular magazines. to mention just a few. 


t this need, I have found it useful to distinguish 
knowledge. I have deliberately omitted or 
de-emphasized many concepts and methods that are essential in carrying 
out research or constructing standardized tests. And I have deliberately 
stressed the concepts one needs in order to read a journal article, choose 
a test to serve a particular purpose. or interpret an individual’s score. 
Needless to say, such a distinction cannot be made sharply, and perhaps 
no two psychologists would make it in precisely the same way. 


In my attempt to mee 
between producer and consumer 


My primary aim has been to make the basic ideas clear. l am aware 
that this purpose has led to what some readers may consider to be over- 
simplification of definitions and computational examples. But I am assum- 
ing that a student who really comprehends the underlying themes will be 


vii 


able to assimilate the many variations without too much difficulty. Obviously, 
this little book is not intended as a substitute for thorough courses in statis- 
tics and psychological testing. On the contrary, it is my hope that it will en- 
courage many students to continue their exploration of this extensive and fascinating 
terrain. 


Leona E. Tyler 
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The Nature 


and Function 
of Measurement 


in Psychology NEED FOR QUANTIFICATION 
One of the principal distinctions between present-day 
scientific psychology and the philosophical psychology of 
times past or the everyday psychology of the man in the 
street is an emphasis on quantity as well as quality, on 
numbers as well as words. To the beginning student in 
psychology this fact often comes as an unwelcome sur- 
prise. Examining such unfamiliar quantities as means, 
averages, standard deviations, and frequency distribu- 
tions; weighing probabilities; making difficult decisions 
among uncertain conclusions—these kinds of mental ac- 
tivity do not fit in with his preconceptions about 
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psychology or with his hopes about what he may learn from it. He may feel 
that in having to come to grips with them he is being asked to make a long de- 


tour over a rough and dusty road instead of proceeding directly to his goal— 
a working knowledge of human nature. . 

But the fact is that quantitative thinking is an essential rather than a 
peripheral feature of psychology today. The progress made during the last 
century would, indeed, have been impossible without it. In the first place, 
quantitative methods permit us to draw precise conclusions from experiments. 
In conducting an experiment, what we do is to apply some special procedure 
to animal or human subjects, and note its effects. But it is apparent to any 
careful observer that reactions to stimulating situations vary a great deal 
from person to person. At a sudden clap of thunder, for instance, one person 
rushes to the window to catch sight of the lightning flashes, another buries 
her face in her hands. So, too, in experiments, reactions vary greatly. Without 
some way of answering the questions “How much?” and “How many?" the 
outcomes of almost all experimentz involving more than one subject thus 


become contradictory and confused. And unless we study more than one sub- - 


ject, we cannot expect to find out much about human nature. 

In studying the learning process, for example, a psychologist may wish t0 
find out whether it is better to present assignments in elementary logic so that 
students are prevented from making errors or so that they are permitted to 
make errors and then correct them. (Such questions become important whe? 
we undertake the practical task of constructing programs for the so-called 
"teaching machines.") In setting up an experiment to answer this question 
the psychologist would work out ways of quantifying the performance of his 
subjects in as many aspects as possible. Anticipating that they will not all 
react in the same way to the experimental conditions, he will not be surprise 
if some persons score higher under Condition A, some score higher under 
Condition B, and some do equally well under both. And it may turn out that 
Method A enables some subjects to learn very rapidly, but at the cost ° 
not retaining what they learn, whereas Method B may take longer but 
facilitate retention. Clearly, the Psychologist must have many complex poms 
sibilities in mind as he plans the experiment. 

So he would resort to different kinds of measurements to enable him t° 
evaluate these possibilities, Thus, he might measure first the time each 
subject requires to reach a certain level of performance. He might give 4 test 


of logical thinking immediately after the experimental procedure, and agai" 


a month later. He might devise tests for other kinds of mental ability thought 
to be related to logical reasoning, such as practical problem-solving OT M 
detection of errors in Propaganda. For each set of scores, he would compat’ 
the averages for Groups A and B, making use of statistical methods that 
would permit him to come to a conclusion stated in terms of probabilities. 


What he reports, then, would take this form: “In my sample Method B 
came out better, and there are less than five chances in a hundred that a 
difference of this size could have arisen in a sample from a population in 
which no true difference existed.” Without quantifying his data he could have 
drawn no conclusions at all; with quantification he can make a statement of 
probability upon which to base further decisions—either to proceed with one 
of the methods or to do another experiment. 

Thus measurements help psychologists make decisions about what their 
research means. They also facilitate decisions about individuals. Psycholo- 
gists began their efforts to construct mental tests because tools that would 
help make such decisions were needed. The ancestor of our present-day 
intelligence tests was Alfred Binet’s 1905 scale. Binet developed this test as 
a means for helping Parisian school authorities decide which children were 
Teally unable to profit by the regular school program, no matter how they 
tried. Whatever can be said about the, misuses of intelligence tests from 
Binet’s time to our own—and there are many serious and valid criticisms that 
can be made—they have in countless instances enabled teachers to distin- 
guish between the dull and the lazy far more efficiently than they could other- 
wise have done. 

Let’s take another example. During World War II, all branches of the 
Armed Forces would probably have preferred to recruit men who were above 
average in intelligence and emotional stability, high in mechanical and 
Clerical aptitudes, athletically skilled, and physically fit. Obviously, though, 
there are very few such all-round men. The problem for military psycholo- 
gists, then, was first to find out what particular abilities and skills were 
heeded in particular military jobs. Next, they devised tests to measure these 
abilities, and finally they assigned men to special programs on the basis 
Of their scores, If, for example, 
Quick perception of details in written material, a knack 
tests measure, the sensible procedure seemed to be to assign to the personnel 
Program CUR rated high on clerical tests but not so high on mechanical 
tests, The men who scored higher on mechanical than clerical tests would be 
left. for assignment to other training programs, such as that for airplane 
Mechanics, Because there were many more or less distinct aptitudes OF spe- 
Cialized mental abilities to be measured and millions of recruits to be assigned, 
the decision-making task the military psychologists faced was extremely 
Complex, They did not always succeed in assigning individuals where they 
Could make their most effective contribution, but the number or satisfactory 
Placements was far higher than it would have been had they relied on verbal 
descriptions rather than numerical scores. Quantification was an indispensable 


tool in the assi individuals to training programs. 
e assignment of individu > 
Another kind of decision about persons has fostered a whole family of 


that clerical aptitude 


the work of a personnel clerk requires mainly” 
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tests and measurement techniques. As more and more people have become 
interested in mental health and aware that psychological ills can be Mes 
demand for personality tests has grown. When a patient comes toa pa 
hospital or clinic, it is helpful to determine at the outset what kind of person — 
he is and how he is likely to respond to treatment. Take the case of, | 
Lawrence Gibbs. Should his condition be labeled "schizophrenia" or M 
reaction"? His psychiatrist may be looking íor clues about whether Gib 
will respond better to hospital treatment or to therapy at a clinic in his ow? 
community. And if he enters the hospital, should he be assigned to a hel 
group, or would some form of individual treatment carry a better prognos! A 
These are complex decisions that can seldom be made with any great C 
tainty, but personality tests of several kinds can help in making them: "E. 
Measurement techniques are important also in situations where individu 3 
must make decisions about their own lives, In some ways, these decisions E. 
person are more complex than decisions about a person. Many of the m. 
entering into them cannot usually be quantified —nighly personal factors d 
as Henry's knowledge that his mother will be terribly disappointed if he oa 
not go into the ministry, or Mr. Merton’s feeling about a job that ae. 
him to be away from home for long periods of time. But the fact that e 
are quantitative answers to some questions often simplifies such pe. 
enough so that satisfactory decisions can be reached. The score on one ". 
of test, for example, will tell a college senior who is trying to decide whet ia 
to go on to graduate school or to go into business how he compares in comp NE 
intellectual abilities with successful graduate students in his major area. T | 
Score on another kind of test will enable him to judge how similar his intere 
are to those of both businessmen and scholars in his field. Day after day; E 


: on” 
high school and college counseling offices, psychological measurements © 
tribute to such decisions, 


Thus, testing techniques and the statistical methods that go with them hav? | 
been woven into the whole fabric of general and applied psychology. We sh Fi 
now examine in more detail the ideas and the tools that are used. Real «. 
in the use of these methods requires a long apprenticeship, but a general 3 
derstanding of the basic Principles is within the reach of everyone who appli” 
the findings of modern Psychology to himself and to those around him% 
short, anyone, as parent, teacher, businessman, or just plain citizen. 


LEVELS OF MEASUREM 


x tiv 
The Nature, Measurement, as Psychologists use the term, covers a wide range of aC 
and Function ities. The only thing the 


Y all have in common is the use of numbers. 
of Measurement 
in Psychology 


most general definition of measurement we can formulate is simply that meas- 
urement means the assignment of numerals according to rules. 

These rules are not always as restrictive as persons with limited mathe- 
matical knowledge often think they are. Unless we have given the matter a 
good deal of thought, we are likely to assume that all the operations of elemen- 
tary arithmetic—addition, subtraction, multiplication, and division—ought 
to be applicable to all measuring systems. Thus, our first reaction may be to 
conclude that measurement is impossible unless some arithmetical operation 
can obviously be applied. Take the whole field of intelligence testing, for ex- 
ample, Ever since these tests first came into use, thoughtful persons have 
Pointed out that it really makes no sense to say that a boy with an IQ of 150 
is twice as bright as a boy with an IQ of 75. That the boys are very different 
in their response to situations calling for abstract reasoning is true, but there 
is no justification to be found anywhere in their behavior for expressing this 
difference as a quotient of two, a fractior? of one-half, or a ratio of two to one. 
It is just not meaningful to divide one IQ by another. 

Once psychologists became aware of this peculiarity in some of the num- 
bers they were using, they realized that there were parallel situations outside 
PSychology. Take temperature, for example, as we ordinarily measure it on 
our Fahrenheit or Centigrade scales. If the temperature drops from 80? F 
F during the night, we do not say that it was half as 
n. Because zero in the Fahrenheit system is 
àn arbitrary figure that does not really mean “no warmth at all," the number 
Of degrees above zero cannot be handled in the same way thata measurement 
of height or weight can. Clearly, measurements of intelligence are more like 
those of temperature than those of height. 

The general formulation of different kinds, or levels, of measurement that 
has been most useful to psychologists is the one set up by S. s. Stevens.* 
According to this system, we can divide the possible ways of assigning nu- 
merals into four types. Each of these varieties of measurement has rules and 
restrictions of its own. And, for each, certain statistical procedures are ap- 


Propriate. 
The Stevens system, 


Measurement according to the extent f : le. Where thi 
plicable. The first, or lowest, in the series is the nominal scale. Where this 


kind of scale applies, numbers are assigned only to identify the categories 
to which individual persons or things belong. These can be a aay 
Classifications, like placing numbers on football jerseys to identify the p ayers. 
Or they can be labels that apply to £025 of persons, as when men are given 
ment, and psychophysics" in his Handbook of 
1951). 


during the day to 40? 
warm at midnight as it was at noo 


as shown in Figure 1 on page 8, arranges these levels of 
amiliar arithmetical procedures are ap- 


* See the chapter “Mathematics, measure 
Experimental psychology (New York: Wiley; 
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LEVEL 


IV — RATIO SCALES 


Each number can be thought of as 
a distance measured from zero. 


lll. — INTERVAL SCALES 
The intervals or distances between 
each number and the next are 
equal, but it is not known how fer 
any of them is from zero. 


l| — ORDINAL SCALES 


Numbers indicate rank or order. 


LIMITATIONS 


There are no limitations. All 
arithmetical operations and all sta- 
tistical techniques are permissible. 


In addition io procedures listed 
below, addition and subtraction 
and statistical techniques based on 
these arithmetical operations aro 
permissible. Multiplication and di- 
vision are not permissible. 


In addition to procedures below, 
ranking methods and other statisti, 
cal techniques based on interpro- 


ILLUSTRATION 


C e 


| — NOMINAL SCALES 
< Numbers are used to name, iden- 


tify, or classify. 
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tations of "greater than” or "loss 
than" are permissible. 


- 


The only permissible arithmetical 

ures are counting and the 
statistical techniques based on 
counting. 


Figure 1. The levels 0f measurement. 


a code number of 1, women a code number of 2. The only arithmetical opera* 
tion applicable to nominal Scales is counting, 
dividuals in each class. The identifying nume 
added, subtracted, multiplied, or divided. 


The next level of measurement is 
are abl 


the mere enumeration of 1” 
e 
rals themselves can never P 


The percentile scores often used in reporting test results to students, since 
they too are a kind of ranking, also constitute ordinal measurement. The 
common arithmetical operations—addition, subtraction, multiplication, and 
division—cannot be legitimately used with ordinal scales, but statistical pro- 
cedures based on ranks are appropriate. It is possible, for example, 
termine whether the ranks of a group of children on popularity are rel 
to their ranks on dependability. g 

. The third level of measurement is the interval scale. What distinguishes 
interval from ordinal measurement i$ that it permits us to state just how far 
apart two things or persons are. Most school tests are of this type. Although 
there is some question about their measurement properties, teachers usually 
regard the scores on course examinations as interval scales. They consider it 
legitimate to compare the scores two students make and to tell Jerry that his 
Score is 16 points lower than Bill's. But interval scales do have one important 
limitation, They have no real zero point. It is true that a student may oc- 
casionally come out with a score of zero on a test of trigonometry or English 
literature. But this does not mean that he has no knowledge whatever of the 
Subject. Although the purpose of the test is to measure knowledge of subject 
Matter, it is not really necessary to define what “zero knowledge” means in 
Order to do so, since we will be using the test mainly to compare individuals 


to de- 
ated 


with one another. 
Though we may add and subtract scores on interval scales, one arithmetical 


Operation is never legitimate—that is, to divide one score by another, since 
division presupposes the existence of an exact zero point. Why not? Consider 
for a moment an examination in which Louise scores 80 and Marie 40. Now 
Suppose that in writing this examination, the instructor had happened to in- 
Clude 10 other questions easy enough for both girls to answer correctly. In 
this case Louise would have scored 90, Marie 50. The difference between 
their scores would be 40 points in either case, but the quotient of the two 
Scores would not be the same. Instead of 2 (80 — 40), it would be 1.8 (90 = 
30). With any particular test, then, we have n 
Person's knowledge is twice as great. three times as great, or one-and-a-half 
times as great as that of another person. But if we can assume that each ques: 
tion is an equally good indicator of knowledge of the field, we do not violate 
anY principles of mathematics or logic when we subtract one score from 
Another or when we add a number of scores together and take an average. 

he Statistical methods used to translate Taw scores on tests into varlous 

inds of derived scores—methods we shall be considering in a later chapter 
(see pp. 35-38)—rest on addition and subtraction exclusively. F 
all Pi fourth and highest level of measurement is ud : wet ems 

metical operations can be used—addition, SU raction, 


o way of finding out whether one . 
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tion, and division. Ratio scales have all the charactertistics of m pe 
with the additional advantage of a true zero point. We are proba A € 
familiar with such scales than with any of the other types because a m" 
mon physical dimensions—height, weight, volume—can be a i 
way. On the scales of an honest butcher, zero means that no Man 2 
been placed on the pan, and when we buy from him we can confide e^. 
that a six-pound roast is four pounds heavier than a two-pound n M. 
also makes sense to say that it is three times as heavy. The name ratio d 
signifies that we can divide one number by another or express the two 

tio. : A 
j Many of the measurements we make in psychology fail to qualify sn 
scales, but a few of them do. These few usually occur where it is [o4 NE. 
measure a menial characteristic in physical units of some sort. W - 
measure reaction time, for example, we use the customary time units, ee 
and fractions of a second. If we are interested in determining how qu! à 
would-be automobile drivers can Step on a brake pedal in response x Mr 1 
light, we can do so with a ratio scale. Thus, if it takes John fivertent Ee 
second and Bill only three-tenths of a second, we can, if we like, make still, 
ratio of these two scores to describe how much quicker Bill is than John. E, 
even when we are working with a ratio scale, as in this instance, we €^ 
wish to make ratios or divide one number by another. For all the met! In 
permissible at lower levels of measurement can be used at a higher one. " 
practical applications of studies of reaction time, for instance, what w— 
vestigator would probably want to know about both John and Bill is how | eni 
compare with norms, or Standards, for adequate drivers, Such comparis 
involve only subtraction, not ratio-making. ad 

Most of the scales psychologists have worked with are of the ordinal 2! 


ith 

A M idt wit 
interval varieties. If we recognize this and observe the limitations that go 
such measurements, we reach sound conclusions ; 


errors creep into our reasoning. Without some kn 
and measurements that have bi 
Statement may be hard to gr: 
As we have said 
limits the use of t 


I0 


college appli 
Laer Rt ron Low him. Clearly, 60 is above average, and 40 is 
Sinaia) amit er d elp the boys reach correct conclusions about these 
acier aliia pus the counselor must be aware that the units on 
those near the avera eae a a low or very high scores, but small for 
eleg t6 300. Any mom acana of this fact, scores of 60 and 40 are both very 
ES ae S th usion that Sheldon should go to college but that Jeff 
epee completely unwarranted. Both boys have an average 
B ous stia: a se What this teans for practical purposes is that 
drawing crate ind ica of edle statistics must be very cautious in 
U pie knows erm out the meaning of differences expressed in percentiles. 
loei: furca C ing about ordinal scales he will at least be aware of the 
With “thi ution. 
war Mex at eem understanding that numbers do not always mean 
ey mean, let us turn to some of the statistical ideas we must 


Brasp if 
D if we are to use psychological measurements intelligently. 
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Basic Statistics 


ORDER IN VARIATION 
At the beginning of the Previous chapter, we raised the 

problem of variation from Person to person. More than a 

century ago, Lambert A. Quetelet, a Belgian mathema- 

tician, discovered a kind of order in individual variations 
that impressed him deeply. Quetelet worked mainly with 
census data and measures of physical characteristics, but 
it was not long before Sir Francis Galton, the ‘brilliant 
English scientist, was applying Quetelet’s methods, and 
others he himself developed, to measurement of character- 
istics like 


« s 
acuteness of Vision"—traits we would now 


" classify as Psychological. 


I2 


2000 


g 


ime 2. Chest meas- 
ron ents of 5738 sol- 
suorum Galton, F. 
ry itary genius: an in- 
P o its laws. Apple- 
» 1870. Horizon Press, 


1952, chapter 3) 


Number per ton thousand 


33 34 35 3 
Chost measure in inches 


E uH these early workers hit upon—one that has been in constant 
and then ince— was to obtain measurements on à large group ot individuals, 
Called a "habui c the results in order from lowest to highest in what they 
tern sho equency distribution. In these distributions they found the same pat 
Ens up again and again, a pattern that can be illustrated by Galton E 
The ments of the chest girths of more than 5000 soldiers, shown in Figure 2. 
Curve, Fi iss when a distribution is pictured in à graph, is à bell-shaped 
forming ae ge number of cases clusters near the middle of the distribution; 
measureme, Sverages or central tendency. The farther from this average m 
3pplies. Yo lies, the smaller the proportion of the total group to whom it 
41 inches LI there are many men with chest measurements of 39, 40, and 
other ue only a few with as low as 33 or as high as 48. Quetelet; Galton, 

€ distrib nineteenth-century workers found such a beautiful regularity 1n 

i utions of trait after trait that they came to believe there was some 


Nive; 
Pe law governing human differences. 
Clans, of ge tia bell-shaped curve was already f 
Pointe ad discovered it as they studied the outcom 
Matica] e 1ng, dice-throwing, and the like—and had work ema 
Be bes D for the curve of the distribution obtained with an infinitely 
ee er of events, each determined by what we call “chance.” This 
Probability curve (or Gaussian curve, as it is called in honor of the 


amiliar to mathemati- 
es of games of chance 
ed out the mathe- 
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mathematician who formulated its equation) became one of the foundations 
of modern statistics. To Quetelet and Galton, the equation for it seemed to 
apply equally well to the distributions of measurements they had collected. 
The word “normal,” as used in Statistics, applies to this curve and carries 
no connotation of “right” or “best.” 

As time has passed, it has become clear that human measurements, bo 
pecially measurements of psychological traits, do not always yield a Gaussian 
distribution. Some distributions are skewed, meaning that there are either 
too many high scores or too many low scores in the group to produce the 
Symmetry of the normal distribution curve: others are too peaked or stretched 
out to fit the normal curve equation. These require special equations. But 
since statistical methods based on normal distributions are easier to use and 
interpret than those based on less common mathematical formulations, test 
makers often make a special effort to collect items that will result in a normal 
distribution in the group for which « test is intended. Thus, if the group v 
which preliminary norms are based turns out to have too many low scores, p 
is possible to take out some of the harder test questions and replace them with 


"em A S 
Figure 3. Distribution of IQ's on form L of the Stanford-Binet test for oi 
2% to 18. (From McNemar, Q. The revision of the Stanford-Binet Sca 
Boston: Houghton Mifflin, 1942.) 
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easier ones. i 
Bn ju» : pm d enough spread, items can be added that will produce 
tions areek bea di ps some testing purposes, non-normal distribu- 
Dis sf inh nen ie t. In short, methods for analyzing the character- 
Be int a ona in a test now make it possible to select items ina 
standardized tests piena any desired distribution. The majority of well- 
ju Bourne 3 we s normal disttibubuns when large groups are tested. 
leading d ee such B distribution for a test that was for years the 
larities, "€ E E ildren’s intelligence. While there are a few minor irregu- 
Eod fora, se great enough to constitute a serious departure from the 
BE Leod. bs large numhers of IQ's in the neighborhood of 100, 
Ebtlatorscnses | extremes of 55 and 155. (A very small proportion of the 
ower and higher than the arbitrary end points shown here.) 


MEASURES OF CENTRAL TENDENCY 


ts for a group, we must al- 


In 

su izi EE 
mmarizing a distribution of measuremen 

describe in some way the 


Wa; 
dans characteristics. First, we must 
average, at a ae by computing what we ordinarily call an 
we Bil adi e statisticians call the arithmetic mean. To get this figure, 
Eere al the scores together and divide by the number of cases. 
COET kinds of averages, however, that are preferable for some 
Score in a di s e metum and the mode. The median is simply the middle 
the two Mn " ribution, or the number that would represent a point between 
* mast pers e ones. The mode is the most common. score, the one made by 
ons. In Table 1 on page 16, we see what the mean, median, and 


mode w, 
0 
uld be for one set of figures. We can also see why we might prefer one 
for allowance an amount 


to the 
at p A parent who wished to pay his child 
i rmed to the central tendency of the group might well be dissatisfied 


With th, 
ari 2 
Broup B mean as an average here. This is why: The two boys in the 
o receive $1.00 allowances are not at all typical. Adding these figures 
ean, 29¢, is too high to be 


in wi 

really qt others inflates the total so that the m 

In this Nm of the group as a whole. The median of 22¥4¢ seems fairer 
Ode of 2 sg a the most convenient index of central tendency here 1 the 

., 2 a dist : eamount received by the greatest number of boys. 

identica] ribution that is exactly normal, the mean, median, and mode are 

tendency 5 it makes no difference which one we use a5 & measure of central 

' y. Since few distributions turn out to be exactly normal, however, We 


must 

usual] ‘ 

i ly decide which is the most convenient and meaningful. Often the 
f the scores besides just 
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TABLE 1 
Central Tendency Determinations for a Distribution 
of Weekly Allowances Received by 20 Boys in a School Class 


Boy Allowance Boy Allowance 
———————— mum Boy Allowance 


1 $.10 1 $ 25 
2 10 12 25 
3 10 13 25 
4 15 14 25 
5 15 15 25 
6 15 16 30 
7 20 17 30 
8 20 18 a 
9 20 19 
10 20 20 1.00 

Mean = $5.75 = 20 + $0.2875, or $0.29 

Median = Between $0.20 and $0.25, or $0.23 

Mode = $0.25 


MEASURES OF VARIABILITY 


Basic 
Statistics 


The second kind of index we need in order to describe a distribution of 
scores is some measure of variability. How much of a Spread is there in the 
distribution? It obviously makes a difference to a schoolteacher, for exam 
ple, whether therange of reading scores in her class is from 50 to 150 or from 
90 to 110. In both classes, the average score is about 100, but the first grouP 
would require much more diversification of reading materials than the secon! 
one would. The simplest way to describe this variability is to subtract the 
lowest score from the highest one. This gives us the index called the range. I? 


the example just given, the range in the first classroom is 100 points, in the 
Second 20 points. 


h t 
y index of variability in a group, one ber 
ly emphasize extreme cases, is the standard deviation. Tb 


dew -— f 
formula is SD = 4[Xg7 (where d stands for “deviation from the mean,” Ñ, ° 


Course, stands for the number of persons in the group, and the Greek sy™ 
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bol = stands for “sum,” thus instructing the reader to add the numbers to 
which it applies, in this case all squared deviations). Table 2 shows how it is 
computed. First we obtain the mean, or average, score in the customary way. 
Then we subtract this figure from each of the individual scores. Next we 


TABLE 2 

Computation of the Standard Deviation 
for a Distribution of the Scores 

Made by 11 Children on a Spelling Test 


4 Deviation Squared 
Child Score from Mean a Deviations 
Elaine 20 +6 36 
Martha 18 ; 44 16 
Bill 15 +1 1 
Jim 15 41 1 
Edna 14 0 0 
Harry 14 0 0 
Marie 14 0 0 
Joe 13 -1 1 
Lucy 13 -—1 1 
John 10 —4 16 
Grant 8 —6 36 
154 108 


Mean = 154 + 11 = 14 
Variance = 108 + 11 = 9.82 
Standard deviation = V9.82 = 3.1 


nd out what the average of these squared 
r of cases, which gives us what 
To obtain the standard devia- 


Square each of these deviations. To fi 
deviations is, we divide their total by the numbe 


Statisticians call the variance of the distribution. 


tion we take the square root of the variance. 
deviation does not have the defect 


It is readily apparent that the standard e 
We noted in the range—its magnitude does not depend greatly on single 
extreme scores, On the contrary, each score in the distribution contributes to 
the standard deviation. In distributions where scores cluster quite closely 
around the mean, the standard deviation will naturally be small. In distribu- 
tions where scores spread out a good deal on both sides of the mean, the 
Standard deviation will be large- 

Where the distribution of scores is normal, 2 perso" who knows the mean 
and standard deviation does not need to see the whole distribution in order 

and standard deviations 


to grasp its major characteristics. The use of means 
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thus furnishes us with a valuable shorthand way of comparing groups and 
individuals within groups. We will explain in more detail in a later section 
how we make such comparisons. 


MEASURES OF RELATIONSHIP 


Besides measures of central tendency and variability, one other statisticial 
device is indispensable in dealing with measurements of human traits. When, 
as is often the case, we must relate one score to another, we use the correla- 
tion coefficient. Although there are many ways of computing correlations; 
one technique, which yields an index called the product-moment coefficient, 
is the most comraon. 

It is also possible—and helpful—to indicate graphically, by a scatter 
diagram, the extent to which two measures are related. So, before we delve 
into the computation of the product-moment coeíficient, let us look first at 
Figure 4, a scatter diagram that plots the correlation between the spelling 
scores of the 11 children in Table 2 and their arithmetic scores. On the graph 
one mark indicates each child's scores in both subjects. Martha’s scores, f0" 
example, are on the line extending upward from a spelling score of 18 and 
on the line extending to the right from an arithmetic score of 25. In a diagra™ 


Figure 4. Scatter diagram showing relationship between spelling and arith- 
metic scores made by 11 children. 


n 


0 
19 
8 
AT 
6 
5 
4 


NUS F 


E, s m judge how closely related two measurements are by seeing 
mies A Sie lie to a single line that might be drawn from the 
felifionship cde € corner. If all entries are on süch a Ime; the 
iig hen sone cocoa jase -—— pt ie this case, it would 
N n Er wees B j es as m ha ove ex P the group average 
EL C us pe ing. This is not quite the case, however, for 
Sicha: eins X e 4. There is a strong general trend for scores to fall near 
Logs gona. ine; but they are not exactly on it. A scatter diagram gives 

ut rough picture of such a relationship. We often need to refine our 
understanding. 
pre correlation coefficient (r) is a precise mathematical 

of the relationship between two such sets of scores. Its formula is: 

AEn (dz) (dy) s 
N(SD:) (SD) 

(In this formula the letter d, again, stands for deviation from the mean, 
N stands for the number of cases, SDs stands for the standard deviation of 
the first set of scores, and SD, stands for the standard deviation of the second 
Set of scores.) Table 3 on page 20 shows how we compute it. First we find the 
Mean and standard deviation for each set of scores, (These are computed for 
spelling scores in Table 2; the same procedure is followed for the arithmetic 
Scores.) We then multiply each person’s deviation from the group mean for 
Spelling by his deviation from the group mean for arithmetic. The algebraic 
Sum of these deviation products forms the numerator of the fraction in the 
equation, We then proceed to obtain the denominator in the fraction by 
Multiplying the number of cases by the product of the two standard devia- 


tions. 
i If the relationship between 
esser degrees of relationship, 


t 
han 1.00. In our example, 7 = .83. We 
as it is, and also why it is not 1.00, if we examine the deviations. The children 


eu Are above average on the spelling test are also above average on ue 
hmetic test, except for Bill. The children who are below average 1n spelling 
sl also below average in arithmetic, except for Joe. Elaine is a little higher in 
gis than Martha is, while Martha is higher than Elaine in arithmetic. 
= aga speaking, each of these children is just about as successful in 
O as he is in spelling. It is precisely such correlations that a high 
d (here 83) indicates. Just as the mean and standard devintion aee 
z bap descriptions of one distribution, so à correlation coefficient provides 
^ ead : the relationship between two. ) 

m A 3 the correlations the psychologist runs into are lower than the 

ve been considering, and sometimes they are negative. A negative 


two variables is perfect, 7 will be 1.00. For all 
it will turn out to be a decimal number less 
can understand why this is as high 
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Computation of the Product-Moment 


Correlation Coefficient for Scores 
Made by 11 Children on Spelling and Arithmetic Tests 


Basic 
Statistics 


Spellin Arithmetic 
‘ Deviation Deviation Product of 
Child Spelling Arithmetic — from Mean from Mean Deviations 
Elaine 20 20 +6 +1 +6 
Martha 18 25 +4 +6 +24 
Bill 15 19 +1 0 Q 
Jim 15 20 41 41 +1 
Edna T M 19 0 0 E 
Harry 14 20 0 +1 
Marie 14 18 0 -—1 9 
Joe 13 19 ed 0 2 
Lucy 13 18 =i = s Ti 
John 10 16 —4 =3 7 
Grant 8 15 —6 —4 UE 
+68 


Spelling mean = 154 Si 14 
Arithmetic mean = 209 +N=19 
Spelling standard deviation 
(computed in Table 2) = 3,1 
Arithmetic standard deviation 
(computed in manner shown in Table 2)=24 


Product-moment correlation (r) — PRA, NE 
01) X.) x (24) 
83 


r= 


wW 
ee to one another, we should never have been able to find out what a 2 
est Measures or how accurately it measures anything at all. Just how corre 


ti ich 
ond m puris Purposes wil] be considered in the next chapter, Wht 
n tests, 
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ee oso sso STATISTICAL SIGNIFICANCE AND CHANCE 


k So far we hav ; 

disibuton — cep min methods designed to describe 
Bic ideis whee ik s also provides convenient techniques 
oup originally studied. The methods us and evene UH pss 
Complex and difficult 3 . The methods used for this purpose are sometimes 
cal — - io understand, but anyone who works with psychologi- 
grasp, nts should have the basic reasoning behind them within his 

The f : 
to Ee md idea is that any one group of persons ur things we choose 
search, ü ^ ene a sample of a larger population. When we carry on 
Conclusions rath at general population about which we really wish to draw 
children =e ner than the parpena saniple we happen to test. Thus, the 11 
Only a d spelling and arithmetic scores we have been examining are 
Population ^ x of a numerous group of children their age. Whether the 
Country, wa en-year-olds we are interested in is in one town, one state, one 
Stitute 2 vel he whole world, the principle is the same: These children con- 
Pelling ee We would like to infer from their scores something about 

he agua i performance in the population. 

rough te, m b how samples are related to populations has been studied 
, empirica] ines procedures and, mathematical reasoning. In the practi- 
Min ind of study, an investigator may shake dice again and again, 
these “ol e mean of each ten throws, and analyze the differences among 
10,000 — samples of ten from a “population” that consists of, say, 
E" effort E Mathematical reasoning accomplishes the same purpose with 
Tom the - hat we find in both cases is that the drawing of repeated samples 
id Pimp population produces a distribution of whatever statistic we 
Ta ing—a normal one if the population from which the samples were 
a normal distribution. The mean and standard deviation of this 


mpi 
4», ` 
a g distri ^ : 2 
fe ribution can be estimated from the information we have about 
page 22 we see what 


Sa 

ie Mae A NED: one sample alone. In Figure 5 on 
With am istribution of means for samples of 25 would be for a popula- 
is nat Ern ye 100 and a standard deviation of 20. We need not go into 
big," The only im relationships or the formulas used to make the computa- 
Son of the Ce agi point to be gleaned from Figure 5 is that the varia- 
Th în the Boi id samples of 25 is much smaller 
bution e isa erem from which samples are drawn. E 
i standard v name for the standard deviation of the sampling distri- 
eviation that indicates how much variability there is in 


t 


3 


than the variability of 


21 


Basic 
Statistics 


Basic 
Statistics 


Figure 5. Hypothetical 
normal distribution of 
scores for an infinitely 
large population in which 
the mean is 100 and the 
standard deviation is 20. 
"d Also the hypothetical dis- 

li ` r tribution of means that 

jar Toone EN] would be obtained if 

70 80 90 100 110 120 130 140 150 1603 many samples of 25 were 
i VE drawn at random from 


Does such a population. 


a statistic rather than in people or events. It is standard error. Since we can 
compute any statistic we like in samples and in populations, there are many 
kinds of standard error—standard error of a mean, standard error of a dif- 
ference between two means, standard error of a correlation coefficient, stand- 
ard error óf a score. For whichever kind, the standard error is a way of 
showing us what degree of variation, with regard to the statistical unit in 
question, we can expect in. different samples of the same population. 

The concept of sampling distributions enables us to determine the proba- 
bility that any result we have obtained could have occurred by “chance.” 
Chance is simply a label for whatever unknown factors there are that cause 
samples to differ. Why does one throw of the dice turn up different numbers 
from another? Chance. Why do answers on a public opinion poll vary 4 little 
from group to group, even when respondents are chosen in exactly the same 
way? Chance. 

To illustrate this reasoning, let us compare some scores to find out whether 
the difference between two means is a matter of chance. Suppose that the 
teacher of our 11 children adopts a new method whereby she hopes to make 
them better spellers. After a month of intensive training under the neu 
method, the children take a second test known to be exactly equal in difficulty 
to d first one. Their scores on the first and second tests are shown in Table 
4. It is clear that the average for the class had gone up from 14 to 16 While 
there was one child who actually scored lower on the second test than he did 
on the first, the rest all improved to varying extents. But is this 2-point differ- 
iic really large enough so that we know that it could not have occurred by 
chance if we had taken a second sample of these children’s spelling without 
giving them any special teaching at all? It is this sort of question that the t 
ae was designed to answer. What ¢ is, is the ratio of the difference 
sane tun e rd ror fhe rcs (Tee f ua 
a ealicetrondin eee a 

E from what was learned by taking the first test- 
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E for this would require a more complicated research design. For the time 
g, we will ignore the possibility.) Before considering what ¢ means, let us 
see how it is computed. 
E. pelea we determine the difference between the two scores „for 
E us o s im the group. Then we work out the mean and standard 
Men is distabutton of differences just as we do for, single scores. 
E ensi qe this computation differs from the one shown in Table 2. In 
Ea al, unuke descriptive, statistics, for reasons 1t Is not necessary to go 
ere, we divide the total of the squared deviations by one less than the 
total number of individuals in the group, in this case by 10 rather than 11.) 


TABLE 4 
Computation of the t Statistic 


to Test Whether an Increasó in Spelling Scores 
Is Statistically Significant 


Deviation 

Chil First Second from Mean Squared 

d Score Score Difference of Difference Deviation 
iem 20 23 +3 +1 i 
Bil ^ 18 20 +2 0 o 
Jim 15 19 +4 42 44 
Edn 15 15 o —2 +4 
Har 14 11 —3 —5 +25 
mee 14 16 42 0 0 
J c 14 18 +4 +2 +4 
Tus 13 15 +2 0 0 
Tok 13 16 +3 +1 41 
Cn 10 13 +3 + + 
rang 8 10 + 0 0 
154 176 22 40 


Mean of first test = 154 + 11 = 14 


Mean of second test = 176 + 11 = 16 


Sich of differences = 22 + 11 = 2 
ndard deviation of differences = V40/10 = V4.0 = 20 


Standard error of difference = 2.0+ val 
=20+ 33 = 6 
t — 2 (Difference mean) 
= > (Standar error of difference) . 
P< 01 


n two means (the standard 


ifferences), We divide the 


oot of the total number of 
pro- 


To 
devia the standard error of the difference betwee 
ion of the sampling distribution for such d 


Stang : 
ard deviation of the differences by the square T 


Cases S 
edure eai, we need not go into the mathematical reasons for this ; 
E. ratio consists of the mean of the differences divided by the 
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standard error of the differences. In this case, it turns out to ev vci 
tables of £, available in all standard statistics texts, we learn Tiie a: 
statistically significant at the .01 level (represented as P < .01 in Ta 3a 
This last phrase, translated into the language oí Samples and s a 
and applied to our example, means that no more than once in a hundr igne 
would a group of children like ours give us the results we obtained on pater 
tests if chance factors alone were operating. We can conclude wit oa d 
confidence, then, that the special teaching method was effective. it = Mera 
of course, be effective for the reasons the teacher thinks it is. say 3 eed 
oí the method under various conditions, as well as comparisons Wis Pi 
taught in other ways, would probably be necessary: before the teacher 
J just why it worked. . 
z "cte E other versions of the ¢ formula designed to be nie i 
different kinds of data. But whatever technique is used, the — it 
probability statement like the one in the example. These probability ad 
ments are what the term statistical significance refers to. A a sod 
cant difference is one that produces a ¢ with a small probability. THis a 
that it is 2o£ very likely that the difference in question Xd. oeeie in c sins 
samples in which the members had not been chosen in 2 special way pal i 
special treatment. A statistically significant correlation is one that wou ios 
been unlikely to turn up, had we simply put down pairs of numbers at ran ^" 
and gone through the motions of computing r. The lower the probability puc 
then, the more certain we can be that our results represent something 0 
than chance. 
Anyone who reads research reports will repeatedly encounter such ape 
'significant at the .05 level." This is simply a way of stating what we hà i 
been talking about—of saying that results like those reported would have bee! 
expected by chance not more than five times out of a hundred. It is customary 
not to consider results that give probabilities larger than .05 to be “statis 
tically significant,” but the level varies somewhat in accordance with what the 
investigator is trying to find out. " 
The development of statistical techniques that permit research workers i 
make inferences about whole populations from samples has enabled us " 
increase our knowledge far more rapidly than we could otherwise have pe 
Besides that, however, an understanding of these techniques, at least 1? 


; rch 
general way, is useful to any intelligent citizen who is not himself a rese? iè 
worker, in that it helps him to evaluate what he hears and reads about peoP” 
and prod 


evi 
ucts, to draw some conclusion as to whether figures presented a5 
dence fo 


ove 
r the superiority of a new tooth paste, for example, really PY 
anythin; 


$ : s itis 
g. Intricate as the reasoning included in statistical inferences is, 
well worth the effort to master it, 
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7 


as 


So far we have not considered whether there is any dis- 


tinction between the terms test and measurement. Al- 


though their meanings overlap, they are not quite 


synonyms. The latter word applies in many areas of 


psychological research where the former would not be 


appropriate. For example: Experimenters who study 


sensation, perception, and judgment make extensive use of 
ment of physical mag- 


psychophysics, that is, the measure 


nitudes corresponding to psychological magnitudes (such 


things as how bright a light looks, or how loud a tone 
sounds). If the question under examination is, say, “What 
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are the upper and lower limits of human hearing?" what they measure are 
vibration rates. The physical measurements are thus used to answer a psycho- 
logical question. 

À measurement is called a test only if it is used primarily to find out some- 
thing about an individual rather than to answer a general question. Meas- 
urements of pitch thresholds can, of course, be used as tests. But p 
typically a test consists of questions or tasks presented to the subject, an 
the scores obtained are not expressed in physical units of any kind. ‘ 

Thus, not all measurements are tests. But the reverse is also true—no 
all tests are measurements, There are some personality tests, for example, 
that do not produce scores. A psychologist may use such a test to help him 
formulate a verbal description of a person. Measurement, at any of the 
levels we distingvish in Chapter 1, need not be involved. In Chapter 1 meas- 
urement was defined as the assignment of numerals to things according to 
rules. A test can be defined as a standardized situation designed to elicit g 
sample of an individual’s behavior. When that sample can be expressed as 
numerical score, either test or measurement is an appropriate term. 

Thus, even though the overlap between the concepts is not complete, We a 
still say that most tests are measuring techniques, and most psychologie 
measurements can be used as tests. We shall take up in this chapter some ideas 
that apply particularly to tests. 


HISTORICAL BACKGROUND 


sychological 
Tests 


^ -— r- 
Throughout the years since psychology got started as a scientific ente 
prise (the date usually cited is the 


ve 

I Shape, and size we all experience. They N 

set up experiments to study the learning process, expressing their findings - 
Ls . » 

laws of learning. They have studied human development, by comparin 


9ne age group with another, and have established norms of behavior for ec 
age. 


An American who studied in Wundt’s laboratory, James McKeen Cattell, 

pu influential in the movement to utilize psychological meas- 

"aps ie mental tests. It was he, in fact, who first used the term “mental 

alles 20; but there were others who figured prominently in the same 
nt during the closing years of the nineteenth century. 

What these scientists were most eager to find was some quantitative way 
of assessing general intelligence. They thought they could obtain an index 
of intelligence if they could measure in combination in individuals all the 
characteristics that were being meacured separately in the experimental labora- 
tories—sensation, perception, attention, discrimination, speed of reaction, and 
So on. The superior man, according to their reasoning, should be the person 
who ranks high in all these qualities. Though there is nothing obviously wrong 
the attempt to measure intelligence in this way failed. 


with their reasonin| 
8, 
ments, they discovered that 


For when the psychologists analyzed their measure 
these traits were not closely correlated’ one to another. Furthermore, the sum 
Of the scores did not appear to be an index of general intelligence. Poor stu- 
dents, for example, achieved just as high scores as good students did. Clearly, 
a different approach to the problem of measuring intelligence seemed to be 
Tequired. 

This new departure was soon made in France, when Alfred Binet started 
from the premise that intelligence is an inherently complex characteristic, 
not merely the sum of many simple traits. To measure it, he held, we must 
find ways of evaluating how individuals deal with tasks that require reason- 
ing, judgment, and problem-solving. So over a period of years Binet tried out 
—on his own children and on children of different ages in the Paris schools— 
many kinds of tasks as potential tests of children’s intelligence. At last in 
l 905, with the collaboration of Thomas Simon, he published the first real 
Intelligence scale, the ancestor of all our present tests. 

Progress was rapid from then on. The Binet-Simon scale was adapted for 
Use in many countries. Tests suitable for adults were added to those designed 
€specially for children. Group tests were produced during World War I and 
Adapted soon afterward for use in schools and industries. Tests for many 
abilities not so broad and general as intelligence were constructed. Attempts to 
Measure personality characteristics as well as abilities became more and more 
Common, By the middle of the 20th century, thousands of tests—good, bad, 


x indifferent were in print. 

SERE Miei of all this activity on the menta i 

to PU M aa These now serve as convenient gul 

already 4 ct new tests and to those who nee 

Virtually Ae Furthermore, in our test- 
come essential for an educated man or woman 


ntal testing front, standards and 


des to those who wish 


conscious mo S 
to know something 
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; class- 
about these standards and principles. The teacher who wishes pw 
room instruction to the differing characteristics of the — soe eave 
the parent who wants to help his son or daughter make Muse P doped by 1 
tion and career, the businessman who hopes to avoid barred otiam 
scrupulous test publishers and salesmen—all need a working kn 
tests are built and how they should be judged. 


THE MEANING OF VALIDITY 


Psychological 
Tests 


" ues- 
The most important consideration is validity, which pertains phe 
tion: “What does this test measure?" Unless we have a fairly d Ea 
to this questiofi, any test will be useless in our attempts to e be worse 
human beings—adults or children, ourselves or others. It may Mee : person's 
than useless, because if we act on the wrong assumption Sue E i5 aladjust 
Score means, we may steer him toward decisions that will lead to 
ly mistakes. thing 
y om ede fail to recognize that the title of a test really ud a 
at all about what the test measures. Anyone can write a set st ai A 
he may be perfectly confident that they call for “reasoning” or But to fi? 
aptitude" or "cognitive flexibility" on the part of the Mos spoon these ques” 
out what mental processes the testee must actually use to answer measure 
tions is a long and arduous task. A test Mr. Smith constructs M 
"reasoning ability" may prove to be a test of middle-class oA intelli 
“mechanical-aptitude” test may turn out to measure mainly we teste 
gence. An “emotional-maturity” inventory may measure only wi we one 
know about the social desirability of certain behavior, The first E ardinÉ 
must acquire in evaluating the validity of tests is the habit of € this 
their titles. The important question is not *What does the author € 
test?” but “What are scores on this test related to?” 6 defin? 
At the outset of the testing movement, the accepted procedure was w ho 
st what one intended to measure and then collect evidence to pe we 
Successful one had been. During those years, as we saw, most e it be 
directed toward the measurement of general intelligence. And althoug not a! 
came apparent as the number of investigators increased that they we rpos 
defining this elusive term in exactly the same way, still, for practical É€ of 
they were all assuming that intelligence is the trait on which jou ua 
brightness and dullness in school are based. Thus they looked to em the 
tions for evidence about the validity of intelligence tests. How e j fro 
asked, is the relationship between the degree of school success e 
test scores and the degree of success actually achieved? I: was on t 


fir: 


pasis 0 
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the resulti 
sulting evi 
g evidence that the laboratory measurements of discrimination, 


reaction tim i 
e, and the like, were discarded, and the Binet-type questions, 


calling fi 
g for mor > ; 
more complex responses, Were retained in intelligence tests. 
in a real-life situation such as a 


Such 
measurements of a psychological trait 
Correlations between test scores 


E - called criterion measurements. 
fice: iniquis o called validity coefficients. 
i fervere , then, validity meant the extent to which a te 
odes mu: penes To find out how valid a test was, one was expected to 
berane i Man Lege with criterion Loosspecti- But as time passed, it 
Unfortunatel 2: there es ZOPA NON in both procedure and concept. 
will be an es m impossible in most instances i find any one criterion that 
Sting the sa m iguous indicator ofa mental trait. Two psychologists inyesti- 
different s trait —mechanical aptitude, for example—niay decide to use 
Dentes dh pna and so may achieve different results. Mr. Allen may think 
n Aa o» measures to rely on are the grades high school boys receive 
Eus e : m Mr. Baker mày think that the length of time it takes new 
Now vim simple mechanical skill in a factory is the relevant criterion. 
59 Eo. ^ if cs e they both use correlates 0% with one of these criteria, 
ord x other? How are we to say por valid the test is when it gives 
Oto is sokkr Is jt a test of mechanical aptitude or not? 

S uci HEXDEReHGES like these the realization has grown that the validation 
the E. a long process rather than a single event. Only through studying 

correlation with a variety of criteria can we understand just what it 
research studies on the «mechanical-aptitude” test, 
strate that what it really measures is the ability to 
vements and that it has nothing to do 
ionships of mechanical parts. Thus, it 


but not with grades 
t for 


st measures what 


gc A series of 
carry ee may demon: 
With ES sin controlled, skilled mo 
May "ae ility e grasp complex relati 
in Mer ate fairly highly with grades in wood shop 
another e shop. It may select competent workers for one factory but nof 
m eii instead of asking the old validity question, “To what extent does 
ask, “Ju d what it purports to measure?" we are now more likely to 
must dw what is it that this test does measure?" Today, We kupm toar P 
erent riam pu: content of the test and examine many correlations with dif- 
0, our ip in various groups before we can know the answer. And as We 
to one EE oC yat the basic mental traits are and how they an 
dividual aa continues to grow and change along with our knowledge of he 
9rmulate For themselves. Tt is not really necessary that a piyrbolnes 
Measure, Jf beginning a precise definition t will 
he has a general idea about the ch 


of the trait he hopes his tes 
© either th aracteristic and its 
eoretical concepts or practical situations, 


his precision in defining 
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it will increase as he tries out his test in a series of separate research studies. 
Understanding what is involved in the validation of tests carries three den 
implications for test-users. One is that if a test is to be employed in making 
decisions about individuals or groups, all the available evidence should be 
studied before any attempt is made to interpret the scores. This means p 
someone in the organization using the test must be capable of making rad 
a study. A “testing program" in schools or in industry can function effective y 
only when it is run by a thoroughly competent person. ^. 
Another implication is that whenever possible a test to be used for p 
diction or selection (as of job applicants) should be validated in the Ee. 
situation in which it is to be employed. This, of course, requires a pue 
research study on the spot before one begins to use the test routinely to e n 
which applicants to hire, The relevant validity question to ask when poe g 
study is planned is not “Is the test valid?” but “Will this test add -—- 
to the validity of selection methods-now being used?" A college that has es, 
applying definite standards of selection on the basis of high school grades me 
find that a so-called “college aptitude” test has nothing to contribute E j 
program. Another college, with a different program, which draws from à ae 
ferent pool of applicants, may find that the same college aptitude test $ 
considerable validity. Not all situations permit this on-the-spot validat 0m 
tests, however. But where it is possible, it is to be highly recommended. |. 
an example of a study of the predictive validity of the Iowa Tests of le $ 
cational Development, made especially for high school teachers, see e 4 to 
It shows, for instance, that tenth-graders who score 25 or 26 are like T 
succeed in college. Out of every 100 students with scores this high, 9! d à 
at least a 2.0 (C) average, 70 make a 2.5 (C+) average, and 40 mak 
3.0 (B) average. iing 
Finally, whether we wish to use tests in practical situations invo e 
individuals or in pure research aimed at increasing our theoretical know a 
of individual differences, we should always remember that our ideas 4 ew 
what the traits are as well as what the tests measure must change 5 pe 
evidence comes in. When a research worker fails to get the correlation 
expected between “anxiety” and "susceptibility to the stress of failure je 
may need to change his views about what is being measured by the 277 ut 
scale he has been using. In addition, he may have to modify his theory a 
the way personality is affected by the threat of failure. Persons who us?“ 
cial tests a great deal, such as teachers in the public schools, should E, 
keep abreast of new knowledge about these tests and also about the er inf 
they measure. Many present-day school teachers, for example, are E. ji- 
decisions about children on the basis of outmoded concepts of what Pi 
gence is. What they need is not just more valid tests but a more penet 
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TABLE 5 


Some Evidence about the Validity 


of the lowa Tests of Educational Development 
s of Ability to Do College Work 


as Measure: 
ITED 
Composite 
T Score Chances in 100 of Earning 
: esl token College GPA of Expected Range of College Grades 
= EE e c B s 


D 


"10th grade) ^20 25 a D 


29-30 97 86 62 

21-28 95 80 so — — 
25-26 91 70 40 i im — 
23-24 86 60 29 = m — 3 
21-22 78 49 19 — — 
19-20 60 38 13 —— — 

1-18 59 28 8 0 -—— —— 

15-16 48 19 4 ——_— — 


13-, 
14 38 12 


le 
na 27 7 1 —_ — — 
9 
10 18 4 
f — = 
E 12 2 
SS 7 1 p” — 


Most Probable Grade 


N = 365 "e i 


(For 
ITE —— dim ———] 
D Forms X-1, X-2, Y-1, Y-2) m T | 
68%: 

F n 

°F Students 95% 
and Geyer ad of Oregon 
poe of Actual Grades Fall Within These Ranges 


Fro, = 
9f 9,2 Carlso ; Universit 
Oregon Gon, J. S., and Fullmer, D. W., College Norms (Eugene, Oregon: University 


Counselin, 
g Center, 1959), p. 11. Psychological 
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s lore test 
knowledge of what the word intelligence means. We cannot really E 
H iditv Hu ] 
validity without examining at the same time the validity of our own ideas 


+. and seii 

Validity is the most important consideration in the construction E 
of all types of test. Next in order of importance is reliability. In e ks 
what this is, we can start again with a question: How accurately or ho 
sistently does a test measure whatever it does measure? ] 

When a person takes a test, many things may influence his — 
apart from the ability or personality trait the test has been b. 
measure. If‘ the testee is distracted or rebellious, he may not give E. iss 
good answers to questions as he would if he were concentrating 0n han it 
This produces a score that is lower than it should be. On the bus pri 
may happen that he has encountered the night before, on à d 3 i 
gram or in casual conversation, some of the specific questions con " than | 
the test. In this case he is likely to obtain a score that is a little hig of i 
his customary performance would entitle him to. These are two ane jonl i 
chance influences on test scores. There are innumerable others—°™ 


quite 


mi 

. sng TOOGLE 
. P m " ^ x ning 
reactions to the examiner, temperature and ventilation in the exami f 


oV t 
good or bad luck in making guesses about things one really does not a not 
to mention just a few. They are all essentially unpredictable, and they. 
affect all testes in the same way or to the same degree. „usel j 

What the test-maker must discover and communicate to the test 
how inaccurate individual scores are likely to be because of such re 
factors. Having this information, the person who must interpret 4 ar 
make the proper allowance for such inaccuracy. His conclusion about ig 
will then follow these lines: Billy is slightly above average in inte 115 
There is a 50-50 chance that his IQ lies somewhere between 105 2x 
He will not conclude: Billy has an IQ of 110. d rep 

There are a number of ways to analyze the reliability of a test 2? on: w | 
the results of the analysis, but they all have the first step in com™ o 
administration of two versions of the same test to a group of ee ae 
in age or social characteristics of the people for whom the test wet g 
If we want to determine the reliability for eight-year-olds of a parte’ se i 
of intelligence, we may give a group of eight-year-olds both E s 
Form B simultaneously. If only one form is available, we ca” ite d | 
separate scoring keys for two haives of the test—usually the odi " P if 
one key, the even items on another—so that we obtain two scores o s 
person rather than one, Tf we want to evaluate the reliability of 2 teat 


a 
ae 
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dexteri 

erity for boys at a vocational school, our procedure will probably be to 
with a short interval between 
for each subject. 
the miscellaneous 
we can 


admini 

. in test to a group of boys twice, 

; P m in aie gives the examiner, two scores 

chance neta ond ye is no reason to suppose that all 

Téason that the imi om uence both scores the same way- Therefore, 
alike the two scores for each person are, the more con- 

the test is measuring some nonchance 


Sistently 
tly, or accurately, or reliably, 
ases there are in a group of indi- 


abilit 
vival E md trait. But the more c 
wi n 3 
AM 3 make very different scores, the more We can suspect that the 
all testees 
testees depend to a large extent on chance factors. 


As we no 
saw in Chapter 2, the standard way of stating the degree of re- 


Sembla 

oe aifferenee) between two sets of scores is the correlation co- 

test-maker ie chance influences are always present to sorie extent, a 

Correlation nde: perfect consistency between two sets of scores; the 

Structeg sal ast is; never turns out to be 1.00. But if the test is well con- 

Much to r IS administered carefully under good conditions, it is not too 

Purposes “trip that this reliability coefficient should be about .90. For some 

ility i. stu m some situations, à test can still be used even though its relia- 

makes AN erably lower than this. What is import: 

Shoula Fan inferences about individuals or groups on the basis of the test 

ow how reliable it is so that he can make the necessary allowances 


Or inaccuracies, 
manly, describing the reliabil 
üt a core versions of it has several disadvant 
tead, rie turns out to be higher if there 15 a 
embers of e scores of the group on which it is base à 
*eliability the group score closer together. Thus, the user, in 
Coefficient of a test for a certain purpose, should insure that 
the one H F based on a group with about the same amount of diversity as 
ample, " e 5 working with. The Army General Classification Test, for ex- 
2: Y reer reliable enough to distinguish levels of general mental ability in 
“tS ang 2 representing as they do an extremely wide variety of natural 
any ign attainments. But it is not reliable enough to distinguish 
f tmd between one college student and another. A perde 
s ed if he den use jum test in a university counseling program would < 
ins a rap ged its reliability by a coefficient based on dE 
St IY coefficie run-of-the-mill recruits. What he needs to look at is @ T€ ia- 
“dents nt based on information pertaining to a group of college 


ity of a test by citing the correlation 
dvantages. The most serious is 
great deal of range, or 
d than it does if the 
evaluating the 
the reliability 


g how accurate OT inaccurate 


For 
tu 
S nat 
Sores ely, there is another way of reportini 
ent, We have explained 


Ona t 
est 
are, the standard error of measure 
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the general meaning of the term standard error in Chapter 2. This particular 
version of that statistic gives an estimate of the range of variation in a Pe 
son's score if he were to take the same test over and over again an infinite 
number of times—if, in other words, we could draw many "samples" of pe 
test scores in his whole "population" of scores. This range is a zone of in- 
accuracy on either side of an obtained score. The logical and statistical foun- 
dations on which this method rests are too complex to be set forth here, but 
the final interpretation is relatively simple to grasp. 

If the standard error of measurement for an intelligence test. for example; 
is 4.6, there is a probability of about two-thirds that an individual's true Score 
is within 4.6 points of the score he has actually obtained. If Greg gets 4 score 
of 154, we can conclude with some assurance that his “true score," free puo 
chance influences, probably lies somewhere between 149.4 and 158.6. Thi 
follows from the fact that in any normal distribution, about two-thirds of E 
cases fall within one standard deviation of the mean. As explained previous ^ 
the standard error is a special kind of standard deviation. There is still E 
chance in three that his “true score” is higher or lower than this, and we "i 
if we wish, using reasoning based on the concept of standard error, say b h 
there is a 5 per cent probability that it might even be as low as 145 or aS s 
as 163. (This is because, in a normal distribution, about 95 per cent of t D 
cases fall within two standard deviations of the mean). But ordinarily We E 
not try to make such exact determinations. We use the standard error p 
guide to the general amount of inaccuracy we should allow for in interpret 
scores and making decisions about individuals. If the intelligence test 1D di 
foregoing example is used to select students for an honors program, and E 
decision has been made that all students who score 155 or higher are eligib i 
Greg with his score of 154 does not at first glance seem to qualify- But y 
teacher who takes unreliability into consideration will recognize that Gre£ e 
belong in the select group, for the probability is fairly high that his true pH 
is at least 155. In such a case the teacher would look very carefully at vario e 
supplementary evidence to find out what kind of student Greg is before pat 
ciding whether or not to assign him to the special class, Understanding V" n£ 
reliability means can thus make a real difference in the use of tests in maki 
decisions. (In stressing the fact that tests are not altogether reliable 
must always remember that other ways of evaluating people are BÉ. 
too. A test-maker tells us how much inaccuracy to allow for. Usually We 
no way of knowing how much we should allow for in teachers’ judgment 
example.) je 

A person who does not understand what test “reliability” means me 
his general evaluation of a test be influenced too strongly by statements ra 
its reliability, For the word “reliable,” as used in our common speech, © 
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favorable c : s 
; M Mon uS reliable (“good”) man, a reliable (“upstanding”) 
area Shit on ) product. Teachers, personnel workers, and 
goo "reci x ely to conclude that a reliable test is, ipso facto, a 
E uisments of tees se they have in mind. Such errors can lead to serious 
Ends us vez "oom agli that a test is accurately measuring some- 
knor what md : tle. We need to know much more about the test. Until 
E klusipus about Vin it measures is, we are not justified in drawing 
vious cm a e from the scores they make. As we saw in the pre- 
and arduon ae as et finding out just what a test does measure is long 
s. "Reliable" does not mean “good” for everything. Validity 


Temain: more important than reliability 
S i tant 
imp liability. 


NORMS AND DERIVED SCORES 


es, we must consider, in 
n which those scores are 
tions a person answered 


I 

g ae dei and interpreting their scor 
expressed rea and talidity; the various units i 
correctly Ós score that indicates only how many ques 
standard i E not tell us very much about the person unless we have some 
Schoo} E ompatison. If Leonard brings home a report that his score on a 
is, "What E test was 47, the first question his mother is likely to ask 
Ormea nd of scores did the other children get 

, Scores used with standardized tests of ap 


Person 3 
~SOnality are designed to facilitate just such comparis 
about the job of deriving, 0 


?" The derived, or trans- 
titudes, achievement, and 
Bin: on of individual scores 
Cores in a norms. We can go r transforming, 
hore veral ways. 

toup re way of accomplishing this 
o B each person's score in 
*'SOns in th tables of percentile norms, simply 
One-half i e group the number of those who r: 
Bence desi those who receive exactly the score i 
example on page 34, 91 out of à class 0 


is to set up percentile norms for a 


to an equivalent percentile rank. 
divide by the total number of 


ank below each raw score plus 


n question. If, in the intelli- 
f 100 high school sophomores 
e rank of 91 in 


Core 5 
Y el 2 s; 
this n 154, Greg's score of 154 gives him à percentil 
The e 
: o : A 
Ind ther common method of translating scores into equivalents that 
e mean and 


icate 
whe yas 
re an individual stands in a group mà 
is a 


Sta 
ndard 

deviati 
iation as a basis for norms. In norm 


relatio , 
h © curve. Pr between the distance from the 
iV large or Eins us papse shows what this relationship is. Regardless of 
mall the standard deviation is in any particular grouP of scores, 
normal 


the base line. Ina 
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MOTA eu t 1 i j Figure 6. The standard 
13 19 25 H 37 43 4? scores corresponding 


[zr Raw scores raw scores at diferent 
it [ taam 1 i i levels for a distributio? 
Hic DER Y Tko Ra tage in which the mean is 3. 
-Standard scores and the standard dev- 


tion ó 


distribution, if we measure off one standard-deviation unit from the meam 
we find ourselves directly under the point where the curve changes from cof 
vex to concave. Three such units measured off from the mean, and we reach 
à point very near the end of the distribution. Thus, if we wish, we can state 
in standard-deviation units how far each score is from the mean and indicate 
in this way where each one fits into the total distribution. Instead of reportin® 
a Score as 25 for someone in the group represented in Figure 6, we can report 
it as —1.0 standard-deviation unit (or — lø, in the Greek-letter notation often 
used). As sóon as ne sees the latter, anyone familiar with the normal distribu" 
tion curve knows just about how far below average the subject is. j 
This kind of transformation has enabled psychologists to make compat 
sons that would not be possible with the raw scores themselves. In anothet 
normal distribution of test scores (see Figure 7), the mean may be 178 an 


r d 
Figure 7. The siad 
scores corresponding ont 
raw scores at diferan 
levels for a distribu 178 
in which the meon i, 
and the standard d* 
tion 34. 
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t diae 
E com big a 34. The numbers along the base line for raw scores 
cA cem: oi ose shown in Figure 6, but the standard scores are the 
Giese! 3 on th 4 on this second test (178 minus 34) is just as low as a 
(or —1g). e first one. Both are reported as —1.0 standard deviation 
E. we often express basic standard sco 
dl — o provide measurements divided into finer 
EL x y all numbers by 10 so that they range from = to +30 
gee rom —3 to +3. Then, in order to eliminate negative scores, We 
opem = that they range from, RO to 80, with 50 instead of zero as the 
ia of these transformations, though, changes in any way the 
eee a relationships between scores. In "Figure 8, we see how à number 
E boy scores as well as the percentile scale are related to te basic 
istribution. Values in vertical alignment on the various horizontal 


res in many other 
units, for instance, 


us types of derived scores. (Seashore, 
Psychological Corporation Test 


Figuy, : . i 

pue 8. Relationships between variou 

Se G. Methods of expressing test scores. 
tvice Bulletin, No. 48, 1955.) 
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lines beneath the normal curve will be equivalent if the norm groups on which 
they are based are similar. 

In addition to percentile ranks and standard scores, there are various other 
ways of making scores meaningful. We will consider the IQ in the next chap- 
ter. Age and grade norms for school achievement tests require no special 
explanation. 


.ASSIFICATION OF TESTS 


In order to organize the whole complex and confusing field of psychological 
testing, we must somehow classify tests and isolate the common characteristics 
of each group rather than concentrate on the special qualities of particular 
tests. There are many possible ways to do this. For one, we might use as à 
basis for classification the form of the test, that is, the way questions are 
presented. The principal distinction here is between individual and group 
tests. An individual test is, in effect, an interview in which the examiner asks 
questions and writes down the testee's answers. A group test is set up so that 
instructions can be given to many testees at the same time, each of them 
writing his own answers. Another distinction, which often cuts across this 
one, is between paper-and-pencil and performance tests. In the first the testee 
thinks about the problems presented and writes down the results. In the 
‘second he manipulates the test materials themselves, such as toys, picture 
cards, or blocks. 

A third basis for classifying tests is content. The principal distinction 
here is between language and nonlanguage tests (or verbal and nonverbal 
tests). Note that this is not identical with the previous distinction. Paper- 
and-pencil tests can fall in either the language or nonlanguage category 
and so can performance tests. Written nonlanguage tests often use pictures 
rather than words or numbers; even the directions can be given in pantomime- 
Within each of these main content categories, further distinctions based OP 
content are possible—verbal versus numerical or pictorial versus geometrical, 
for example. 

Still another distinction that may be of considerable importance in some 
Situations pertains to speed and power. In a speed test, a person's score de- 
pends on how many questions he is able to answer in the time allowed. I? 
a power test, his score depends on the difficulty of the questions he can answer 
Investigations have shown that many tests have both speed and power com” 
ponents, and that many, perhaps most, individuals, score at about the 527^ 
level in both categories. But there are some situations and some individual 
cases where this is not true. For example, a rehabilitation counselor working 
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with an arthritic patient may seriously underestimate his client’s abilities if 
he gives him tests in which being able to write swiftly or move pieces around 
rapidly helps to determine the score. He needs tests in which crippled hands 
are not a handicap. 

These criteria are all of considerable value, but perhaps the fundamental 
basis for classifying tests, and the one we shall use in this book, is what they 
are designed to measure. By this standard, we can distinguish three major 
classes of test, each with many subclasses. First of all, are the tests of gen- 
eral intelligence; second, the tests of special abilities; and third, the tests of 
Personality qualities. Somewhere under these three headings most of the thou- 
sands of existing psychological tests find their places, and within each of the 
three groups we can discover the other distinctions we have outlined. Thus, 
of general intelligence tests, some are for individuals and some for groups; 
they vary greatly in type of content; and some are mainly speed tests, some 
mainly power tests. The distinctions in form and content are found in special- 
ability and personality tests as well; the category of power tests, however, is 
less applicable in the personality than the special-ability group. 


We have considered in this chapter certain concepts and procedures that 
are applicable to all these varieties of test. But there are other concepts and 
Procedures that apply to one of the three areas—intelligence, special abilities, 
and personality—more than to the others. We shall take these up in the 


Chapters to follow. 
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Intelligence Tests The attempt to measure 
human intelligence has entailed more continuous, long- 
term, intensive effort than any other project in psycho- 
logical measurement. In the study of intelligence, philo- 
sophical curiosity interacts with practical demands. For 

centuries men have puzzled over the enormous differences 

in sheer intellectual capacity that separate a Socrates 
from an ordinary citizen, an idiot from a normal child. 

They have asked whether such differences are innate OT 

acquired—whether biological or educational factors are 


more influential in producing them. Eventually, because of 
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the practica -— 
to find bis ariede à ng it became more and more urgent 
E au c PU ue € ys ligence of individuals as accurately as 
talented and untalented, into us à pda sry £ rs 
mastering the citric ed denm i. ne 
through it and busied th ee x epi them. Others raced 
Speculations long b emselves with scientific experiments and philosophical 
ifédded to ba ws se calis reached physical maturity. Clearly, teachers 
3 educate dila e to distinguish Lao different mental capacities in order 
Suit fn bii suitably, Similar problems arose in military organizations 
raiione in connection with attempts to fit individuals into appropri- 
TRE a began their effort to me 
ii ed of thé twentieth century, they lacked a precise idea of the 
Eros ee quality. As we saw in the discussion of validity in the previous 
" S die situation is not unusual, hor is it necessarily detrimental. Even 
B iten ea about a human trait can serve to guide a testmaker as he selects 
Shat ra and plans validity studies. Then, as evidence accumulates about 
En scores are related to, he is able to formulate increasingly clearer 
ka p : "what the test measures and to modify and improve the test. 
Verde : intelligence test for children, the Binet-Simon scale, had been 
«curs ^d a long period during which Binet had carefully observed differ- 
id e thinking processes of children. From this work certain clues about 
intelligence test should be became apparent. First of all, he decided, 


it 

EUM reflect the rapid growth in mental capacity that occurs during child- 

od. Second, for practical reasons if for no other, it should tap the charac- 
“bright” and “dull” children 


ets by which teachers differentiate between 
abont “po nen Binet studied his own little daughters, talked to teachers 
Bire E they judged mental capacity, and carried out various experiments 
danie Ms: ready to present an intelligence test to the public. His tentative 
take iun of intelligence as measured by this first scale was “the tendency to 
the € maintain a definite direction; the capacity to make adaptations toy 
Swe T of attaining a desired end; and the power of auto-criticism: 
Was sti ave seen, the publication of an intelligence test at this particular time 
that Eo, by a request from the Paris school authorities for a method 
just m Id help differentiate between really dull children and others who were 
d = caught on immediately. 
in many cn revisions, in 1908 and in 1 i 
Versity P gnis In the United States Lewis 
_ Tesult was th up the idea and started working on 
e Stanford-Binet test, first published in 1 


asure intelligence, just before 


The Binet-Simon Scale itself went 
nslated and adapted 


911, and it was tra 
M. Terman of Stanford Uni- 
an revision. The 


an Americ 
916, revised in 1937, and 


th 
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brought up to date in 1960. For half a century the Stanford-Binet test con- 
stituted a virtually official Standard for intelligence measurement, like the 
bars and weights in Washington that officially define for us our foot and 
Pound. At present, however, more alternative tests are available than in pre- 
vious years, and the tendency has diminished to refer all of intelligence 
measurement to the Stanford-Binet. But the part this test has played in the 
development of theoretical concepts and practical applications of mental 
testing can hardly be overemphasized. 


DISTINGUISHING FEATURES OF BINET TESTS 


It is not important to know how the successive revisions differed from one 
another or from the original Binet-Simon test. What is worth looking at is the 
body of features that characterize al! versions. First, they are scales. This 
means that the constituent questions and tasks are grouped on the basis of 
their difficulty. Beginning with easy questions, the examiner asks harder and 
harder ones as the test proceeds. A child’s score chiefly depends on how far 
up this ladder he can go, rather than on how fast or how fluent he is (though 
both speed and fluency are required in some of the individual tests). From 
Binet’s 1908 revision on, the tests have been grouped by age levels. A great 
deal of effort has been expended on the task of finding out just what the 
age norms are. In order to assign a test to a particular age level, it is necessary 
to investigate how children of various ages handle it. The tests pegged at year 
ten, for example, must contain tasks that are in the scope of only a minority of 
nine-year-olds, a majority of ten-year-olds, and a much larger number of 
eleven-year-olds. Thus, an intelligence scale of the Binet type is built to meas- 
ure mental growth. 

The second shared feature of the Binet tests is that they yield a general 
global measure of intelligence rather than an analysis of separate special 


aminer (see Figure 10). It requires more skill than one might suppose to ere 
tain a friendly, encouraging attitude toward the testee, while simultaneously 
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Figure o. Materials used in testing intelligence with the Stanford-Binet. 
Y permission, Houghton Mifflin.) 

Figure 10. The administration of the Stanford-Binet test. (City College of 
ew York Educational Clinic.) 


carrying out instructions to the letter. And these instructions are rigid. For 
example, the examiner must never change the wording of a question in any 
way—what might seem to him a minor modification may have the effect of 
making an item harder or easier than it was for the subjects on whom the age 
norms rest. The examiner may praise the testee in order to stimulate his best 
efforts, but he must not do so too often or at the wrong times. He may repeat 
some questions but not others. If it is not clear whether an answer meets the 
specifications for a given age level, he may ask a question, but he must be sure 
not to give the testee a clue to the right answer. To act natural, friendly, and 
relaxed while using procedures and words that are so rigidly specified re- 
quires a great deal of practice. Obviously, a Binet test cannot be given by 
just anyone who comes into possession of the materials. 

Finally, the zystem of Scoring in all Binet tests is tied to the age norms. 
A child's mental age, usually abbreviated MA, indicates the age group for 
which his performance would be typical. For example, an MA of 9-6 means 
that a child is as far along in his general mental development as the average 
nine-and-a-half year old. Thus, an MA is somewhat like the size of a boy's 
suit or a girl's dress. We tell people how big Susie is physically when we say 


that she wears a size 10. We tell them how “big” she is mentally when we say 
that her MA is 10, 


THE MEANING OF THE i9 


The Stanford-Binet and other Binet-type tests also provide for the compu- 
tation of an intelligence quotient (1Q). Until the 1960 revision, when statistical 
refinements were introduced, the IQ was figured simply by dividing the child’s 
mental age by his chronological age, then multiplying the quotient by 100 to 
get rid of the decimal point. For example, a child whose MA is 10, but whose 
actual chronological age (CA) is 9 would have an IQ of 111 (10/9 X 100). 
The IQ indicates just how "big for his age" or “small for his age" a child 


is. It tells us Something about his rate of mental growth up to the time we 
tested him. 


Unfortunately, 
erous unwarranted 
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adult intelligence. But as more knowledge accumulated, it became apparent 


that the “constancy of the IQ" is far from absolute. 
For one thing, even in tests that are standardized and allocated to age 


levels as carefully as they were in the 1937 revision of the Stanford-Binet 


scale, IQ’s at different age levels are not entirely comparable statistically. 
A very bright child could obtain a higher IQ at twelve than he did at six, even 
if his growth rate had not changed at all, simply because the variability of the 
IQ distribution was greater for twelve-year-olds than for six-year-olds. For 
other published tests less carefully standardized, this variation in the mean- 
ing of high and low IQ's from age to age is even more troublesome. 

For another inadequacy, the IQ is not an appropriate way to describe adult 
intelligence. Like physical growth, mental growth in adults lacks the pre- 
dictable regularity it shows in children. From the mid-teens ou, age standards 
are relatively meaningless. It makes no sense to say that 2C-year-old Jim's 
mental age is 25, because 25-year-olds and 20-year-olds do not differ in their 
response to the material in intelligence tests. What looks like an IQ for an 
adult is really a standard score. It signifies that the person occupies the same 
position in the adult distribution as a child with that IQ would occupy in a 
Similar distribution of children. As psychologists studied the complexities of 
IQ scores, they came to realize that al? IQ's, for children as well as for adults, 
could be interpreted as standard scores. What an individual IQ really tells us 
is how many standard deviations above or below average a person is: 

One of the most useful things a student can learn is not to pay too much 


attention to a numerical 1Q. Many first-rate standardized tests now provide 


norm tables from which other derived scores may be obtained—scores that 
th which he wishes to com- 


show directly where a person stands in a group wi 
Pare himself, such as, for example, seven-year-olds, high school graduates, or 
army recruits. The 1960 revision of the Stanford-Binet test, while it con- 
tinues to employ IQ terminology, no longer provides for the computation of 
this score in the old way—that is, by dividing MA by CA. Instead, the ex- 
aminer uses tables that show directly how different a child’s score is from the 
mean, or average, of a representative group of children his own age. This 
derived score is called an IQ but is really a standard score. The IQ concept 
served us well in the early days of intelligence testing, but it is now being 


retired from active service. 


THE WECHSLER TESTS 


the original Stanford-Binet test, 


Besides Lewis M. Terman, who devised 
the 1937 revision and prepared 


and Maud Merrill, who worked with him on 
intelligence’ 
Tests 
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the 1960 revision, another American has been particularly important in intelli- 
gence measurement—David Wechsler. In 1939, Wechsler published a standard- 
ized set of individual intelligence tests designed especially for adults. It was 
called the Wechsler-Bellevue Scale (the second part of the name honored New 
York’s Bellevue Hospital where Wechsler worked). This test immediately came 
into wide use because of the demand, with the unprecedented growth of clinical 
psychology during and after World War II, for the evaluation of intellectual 
ability in millions of adults. The Binet scale in its various revisions had met 
the needs of institutions serving children, but had never been entirely satis- 
factory with adults. 

Since World War II, Wechsler has developed a scale for children built 
on the same plan as the original Wechsler-Bellevue Scale. He has also re- 
vised the adult scale thoroughly. The two current versions of Wechsler’s tests, 
then, are the WAIS (Wechsler Adult Intelligence Scale), published in 1955, 
and the WISC (Wechsler Intelligence Scale for Children), published in 1949, 
Although the two overlap to some degree, the WAIS is principally de- 
signed for ages 16 and above, the WISC for ages 15 and below. 

Wechsler includes many of the same kinds of questions and tasks that 
Binet, Terman, and others had used, but he combines them differently. In- 
stead of grouping them by age levels, he assembles them by type of question 
or task, arranging the specific items within each set according to difficulty. 
For example, all the arithmetic questions are in one subtest, all the block- 
design tasks in another. The subtests in turn are grouped into two main classes 
labeled Verbal and Performance. The verbal tests included in the WAIS are 
headed Information, Comprehension, Digit Span, Similarities, Arithmetic, 
and Vocabulary. The performance tests are titled Picture Arrangement, Pic- 
ture Completion, Block Design, Object Assembly, and Digit Symbol. The 
subtests in the WISC are similar with a few minor changes. Verbal tests in- 
clude questions like these: 

What is the population of the United States? 

Why should we keep away from bad company? 


How are an orange and a banana alike? 
What do we mean by “fabric”? 


Some of the performance tests are shown in F igure 11. 
Wechsler has set up separate 


ance score, and a total score. Finally; 
against norm tables for the subject's 
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Figure 11. Samples of 
material used in testing 
intelligence with the 
WAIS. (The Psychologi- 
cal Corporation.) 


wever, like that obtained rom the 1960 
t of MA and CA. It is, instead, 
n relates to the aver- 


Figure 8, page 37, for the relationship be- 
other 


ance, and total *IQ's." Such an IQ, ho 
Terman-Merrill revision, is not an actual quotien 
an indication in standard-deviation units of how the perso: 
age of his own age distribution. See 
tween Wechsler IQ's and the basic standard-deviation units in many 


Systems of derived scores. 
It was hoped when these scales first became available that differences 


among scores on the separate subtests would facilitate diagnostic evaluations 
Of mental functioning that would be helpful to clinicians and teachers inter- 
ested in knowing more about an individual than on over-all score revealed. Tt 
Seemed possible that an analysis of strengths and weaknesses shown on the 
Various tests could help to answer such specific questions as: In which psy- 
chiatric classification does Mr. King belong? Has Mrs. Logan incurred some 
kind of brain damage? What particular educational deficiencies are handi- 
capping Susan? As time has passed, though, statements about the diagnostic 
Significance of score patterns, or profiles, have become inore and more guarded. 
We know now that there are many possible penis why Mr. Henderson may, 
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for instance, score lower on Arithmetic than on the other tests. Such a dis- 
crepancy may indicate anxiety, but it may simply mean that the subject never 
really learned elementary arithmetic in the first place. Clinicians still analyze 
an individual’s score Profile and scrutinize his answers to specific questions, 
seeking for clues to help them understand him. But they have learned to 
regard the ideas generated this way as mere hypotheses to be checked against 
other information about the person. 

Differences between verbal and performance scores are more meaningful 
than discrepancies between the scores on particular subtests. True, we must be 
cautious in interpreting these differences as well. Neither scale is so reliable or 
accurate that small disparities should be taken seriously. But if a person 
turns out to be 15 Points or more higher on one scale than on the other (15 
Points is the standard deviation of the IQ distribution for the Wechsler tests), 
it may be useful to try to account for this discrepancy and to consider what it 
may mean in planning the person's future—whether this planning has to do 
with clinical treatment, educational undertakings, or occupational placement. 
It is clear, in the first place, that the verbal scale is more closely related to 
Schooling than the performance scale is. A combination of low verbal score and 
much higher performance score often reflects some educational deficiency. The 
verbal score also predicts far more accurately than the performance score how 
Successful a person is likely to be in future school situations, Besides this 
difference in the influence of formal schooling, many other factors may be 
Tesponsible for inequalities between the two halves of the test. When an 
examiner encounters such a discrepancy, he looks for evidence—in the ex- 
aminee's record, in what he Says, or in what others say about him—that will 
perhaps account for the discrepancy and will help those who must deal with 
the person to understand him better. In his report on the person's test re- 
Sults, the examiner regularly includes not just the scores but some qualitative 
analysis of what the scores may mean. 


In general, however, the most valuable information obtained from either the 
Stanford-Binet or the Wechsler test i 


GROUP TESTS OF INTELLIGENC 


about the two leading individual tests of 
ssues are more clearly defined 


as for frequency of use, group 


. We have gone into some detail 
intelligence because the basic concepts and i 
for them than they are for group tests. But 
tests rank far ahead of individual tests, millions of them being given each year 
in schools, in industries, and in military organizations. Thus it is essential that 
the educated citizen know enough about them to interpret their results in- 
telligently. 

Although group intelligence tests 
tests, there are some critical differenc 
limited in purpose than is an individu 
may be intended for a single age or st 


propriate for children much younger or o 
age levels are published as a series. There are three Henmon-Nelson Tests, 


for example, one for Grades 3-6, one for Grades 6-9, and one for Grades 
9-12. Group tests differ in types of questions, too. One designed especially for 
Selecting promising college students, such as the Scholastic Aptitude Test of 
the College Entrance Examination Board, is vastly different in content from 
à group test designed for sorting out army recruits into various training pro- 
grams, such as the Army General Classification Test. 

Such specificity is an advantage in particular situations. As a rule, an ap- 
propriate group test predicts specific criterion scores more successfully than 
individual tests do. If we wish to know ahead of time the likely first-year 
grade averages of applicants to a certain college, it would be of no advantage 
to give each person an individual WAIS test. Scores on the verbal half of this 
test do predict college grades fairly well, but scores on any one of the special 
College aptitude tests predict them even better—and the latter can be ad- 
ministered with a fraction of the time and expense involved in individual 
testing. 

What advantages, then, do individual tests have over group tests? First 
of all, they allow the examiner to get a better idea of how highly motivated 
the testee is, and to encourage him to put out his greatest efforts. For any one 
person there is always the possibility that a score on a group test constitutes 
an extreme underestimate of his ability, if for some reason he was not trying 
to answer the questions correctly. Secondly, an individual test gives a sounder 
indication of the mental capacity of a person whose reading skills are not as 
well developed as his ability to think and reason. Although there are some 


nonverbal group tests, those most common. 


are similar in many ways to individual 
es. Each particular group test is more 
al test like the Binet or Wechsler. It 


ade group, for instance, and be inap- 


ly used do penalize nonreaders. 
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another they ail enable us to compare the general mental ability of an individ- 


ual with the average of Some group to which he belongs or in which he must 
compete, 


INFANT TESTS 
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the others, does include a group of questions for two-year-olds and has been 
widely used in Preschool testing, but it is not suitable for children younger 
than two, or even for dull two-year-olds. The question inevitably arises: Is 
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r os a child’s status at the time he is tested. They are useful 
a a ons and others who deal with very young children and who need 
ey ase their treatments and recommendations on an infant’s de- 
Sides wa age rather than his chronological age. At any age level some 
i more advanced than others. 

br pued up, however, that such infant tests do not predict how 
bons ias s later intellectual development will be. The child who is slow 

ing up alone and reaching for objects on the table before him may 
be the first in his class to learn to read or to master the intricacies of long 
division, Except for cases of extreme mental retardation, it is not possible 
to estimate what a child's later IQ is going to be by giving him any sort of 


test during his first year. 


LIMITATIONS AND M! 


g intelligence for many decades now. As 


Psychologists have been testin 
research and practice with a cer- 


We have seen in connection with validity, 
tain kind of test inevitably lead to changes in our thinking about traits as well 
as tests, It is not surprising, therefore, that the very definition of intelligence, 
at least the “intelligence” it has proved feasible to measure, has undergone 
Continuous revision since Binet first formulated his statement about what he 
was attempting to test. It would be well for all those who are using intelligence 
tests for one purpose or another to realize that what is measured by such tests 
is probably a more limited human trait than they are assuming. 

For there are many things intelligence tests do not measure. They cannot 
give us an over-all index of human quality, as some discussions of “giftedness” 
in children erroneously suggest. They do not get at special talents for art, 
music, mechanics, or human relationships. They do not tell us how success- 
fully individuals will adapt to new situations. 


Neither are they pure measures of innate capacity. There are undoubtedly 


differences among individuals in their hereditary potentialities, but the 


answers they give to the questions we ask in intelligence tests reflect experi- 
ence as well as potential, education as well as aptitude. To jump to the con- 
clusion, however, that tests are valueless because they do not measure simply 
what is given at the beginning of life, and that any order of intellect can be 
created at will through favorable educational influences, is to disregard large 
amounts of contrary evidence. For, in fact, at any one time, a score tells us 
what intellectual development has occurred as a result of some complex mix- 
ture of heredity and environment and enables us to say with some assurance 
what sort of further development to expect under specified conditions. The 
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younger the child is, the more cautious we must be about making remote 
predictions. 

Another misconception that can lead to wrong conclusions about individuals 
is the assumption that intelligence is a synonym for general learning ability. 
In the first place, it has been demonstrated rather conclusively that there is no 
such thing as general learning ability. What we find in people are special 
facilities in learning special things. Thus, one boy quickly learns to throw a 
baseball; another is inept at ball throwing but quick to develop skill in the 
use of carpenter’s tools. In the second place, scores on intelligence tests are not 
related to the rate at which even schoo! learning takes place. The euphem- 
ism "slow learner” is not an accurate way to characterize a dull child. pen 
when two children have identical MA and IQ scores they may differ widely 
the time it takes them to learn how to spell a list of new words or to memorize 
a multiplication table. Bright children get high marks and move ahead rapidly 
in school not because they are faster at learning the same things everybody 
else learns, but rather because at each age they are able to grasp more com- 
plex, difficult material than their age mates can make sense of. Likewise, the 
brilliant adult differs from his fellowmen in what ke is able to think about 
rather than in how quickly he can learn a new skil! or solve a problem. Ein- 
Stein probably learned no more readily than his fellow citizens how to drive a 
car or fill out his income tax returns, but he was able to deal with concepts 
whose complexity was far beyond their scope. 


TESTED INTELLIGENCE AS SYMBOLIC THINKING 
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What all intelligence tests measure is the ability to deal with symbols. The 
more intelligent a person is, the more complex and abstract these symbols can 
be. As they grow to maturity, children increase their capacity for such sym- 
bolic thinking, and they also tend to become specialized in the kind of symbols 
they can deal with most adequately. Thus, both evel and pattern of intel- 
lectual abilities become important considerations. The Wechsler tests, with 
their separate scores for different kinds of 
one person is not equally proficient in al 
chapter we shall have more to sa 
such inequalities. 


Thinking, in which symbols rather than objects themselves are manipu- 
lated, is an essential component of civilized life. To be able to measure this 
capacity, therefore, has been of inestimable value. But let us not assume that 


it is all we need to evaluate in assessing the worth of a Person and in helping 
him to find his place in civilized society. 


questions, have shown us that any 
l varieties of thinking. In the next 
y about the research that has been done on 
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THE DEVELOPMENT OF INTELLIGENCE 


a E question has, down through the years, been of particular interest 
asque vm studying intelligence: What is the nature of the process of 
it ae wth? ‘This problem breaks down into three main subquestions. Is 
er r and predictable from early childhood on? (We have already noted 

nfant test scores do not predict later intellectual level.) At what age does 
growth cease? (We have already noted that the IQ becomes inappropriate 
as a measure of intelligence when’ growth is complete.) Does intelligence 
change in adult life as a person moves from youth through middle age to 
senescence? Although there are still some doubtful aspects, research has un- 
covered fairly satisfactory answers to these three questions. Let us run 
through the present state of knowledge briefly. 

When intelligence tests were new, many studies of large 
were made in order to determine whether the 1Q, which as we have seen indi- 
cates the rate of growth characteristic of a child, remains constant from age 
to age. The answer to the question turned out to be between two extremes, 
the “constancy” and «anticonstancy" positions. Even after various statistical 


wrinkles have been ironed out, it is clear that most children show some change 
in IQ during their school years. Shifts up or down of as much as 15 IQ points 
n over half of a group of 40 cases studied 


on the Stanford-Binet test occurred i 
over a long period at the University of California. Shifts of 30 or more points 
occurred in 9 per cent of the cases, and an occasional change of as much as 50 
points showed up. Still, the over-all correlation between early and later tests 
from the age of six on is about .80. Thus, we can conclude that it is not likely 
that a child will appear to be brilliant at one age and retarded at another, or 
id superior at another. As in the case of physical 
growth, there are some spurts and lapses, but a considerable degree of regular- 
ity in the growth rate in most cases. Psychologists and educators have con- 
cluded from this research that to be on the safe side they should retest chil- 
dren every few years if they are going to base decisions about such matters as 
School placement on test scores. An “old” IQ on a child's record should not 
be used as an indication of his present rate of growth. 

The question about when mental growth comes to an end can be answered 
only by an equivocal «Tt depends." In the early years of mental testing, when 


the IQ was the prevailing index of intelligence, test-constructors had to make 
some decision about the highest chronological age they would use in the de- 
nominator for computing IQ's for adolescents and adults. From preliminary 

erience with these early scales, such 


observations they decided on 16. But exp 


groups of children 


even average at one &ge an 


53 


Intelligence 
Tests 


Intelligence 
Tests 


as the 1916 Stanford-Binet, Suggested that this figure was too high—that some 
adolescents were being rated lower in intellectual capacity than they deserved 
because the denominator of the IQ fraction was too large. Many examiners 
then shifted to 14, partly influenced by some widely publicized conclusions 
‘from the testing of soldiers in World War I that the average intelligence 
of American citizens was about at that level. But as work with intelligence 
tests continued, it became apparent that there is no one age at which the abil- 
ity measured by such tests has ceased growing for everybody. If a test has 
enough difficult questions to permit increases in ability to show (many of 
the early tests did not), the average score for students in schools and colleges 
continues to increase year by year up to the early 20’s. Groups of persons not 
in school, however, do not typically show this increase. Now that accumulated 
evidence has indicated that education influences what we measure as intelli- 
gence, this discrepancy does not seem strange. It is clear, however, that after 
the middle teens the yearly increases in actual capacity to think in complex 
abstract terms are slight compared with the yearly increases that occur at 
earlier stages of development, The accumulation of information and the sharp- 
ening of the “mental tools" for processing such information can continue 
throughout life, of course. ; 
Finally, as to the third question, in the early years of the mental-testing 
movement, comparisons of groups of young, middle-aged, and elderly adults 
Suggested that a decline in intellectual capacity begins almost immediately 
after its peak is reached. But as longitudinal studies of the same persons at 
different stages of life became available, this conclusion was modified. At 
least through the 40’s, slight increases in scores on intelligence tests have been 
found. Significantly, subjects in these longitudinal studies have been well- 
educated persons, on the whole. It seems likely, then, that to maintain or im- 


Prove one's intellectual powers, one must continue intellectual activity. This 
whole problem will be considered in more detail in Chapter 7. 


With all their faults, 
Society. We use them to 


tools they require skillful handling 
and thorough knowledge of what they will and will not do. 
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Tests 
of Special Ability. APTITUDE AND ACHIEVEMENT 


Even the most casual observation of people is enough to 


show us that intelligence is not all that counts toward 


success. Bill Jenkins does extremely well in college science 


courses even though he had only an average rating on 


the intelligence test he took as an entering freshman. But 


he has studied science and mathematics diligently since 


he was 12, thereby accumulating a vast amount of specific 
knowledge and mastering the essential problem-solving 
skills. Henry Harlow becomes an outstanding machinist 
because he possesses to an unusual degree the ability to 
see how parts fit together. Joe McGee shows an amazing 
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sensitivity to fine differences in pitch and rhythm and makes use of this talent . 


in learning to play the violin. Lucille Rosen knows far more about contempo- 
tary affairs and international rélations than anyone else in her high school 
class and thus becomes the natural candidate to represent her school at an 
international conference. We could multiply examples of special fitness end- 
lessly. i 

Psychologists have proceeded along two different lines in developing tests 
to identify such special talents, These two undertakings, achievement testing 
and aptitude testing, at first followed somewhat different courses, but eventu- 
ally the traffic on the two highways merged. It is still customary for writers on 
mental testing to distinguish between aptitude and achievement tests, but the 
distinction has become more a matter of convenience than of basic concepts. 

The terms aptitude and achievement still carry erroneous connotations 
arising from historical sources. In earlier periods aptitude usually meant 
special talents presumably based oa innate, or hereditary, differences among 
persons rather than on differences due to experience and learning. To say a 
child had a high degree of musical aptitude meant that he possessed the kind 
of ear and brain which would facilitate his learning of complex musical skills, 
not that he possessed some of these skills already. Intelligence, as measured 
by the tests considered in the previous chapter, was considered to be a special 
kind of “innate” aptitude for school work. An early book on aptitude testing 
by Johnson O’Connor was actually entitled Born That Way. . 

Achievement tests, on the other hand, were thought to measure what indi- 
viduals liad learned. The search for more reliable and valid achievement tests 
was stimulated by the demand for better and more convenient school examina- 
tions. For in many school situations it is important to obtain an accurate 
estimate of how much a given student really knows about algebra, English 
literature, or chemistry. Various special achievement tests were also con- 
structed for nonschool situations, such as trade tests to enable an employment 
interviewer to find out whether a person claiming to be a skilled worker really 
belongs in the “skilled” category. 

As has already been pointed out, psycholog 
tests measure pure "innate" ability; 
mixture of inborn potential and educa 
for other varieties of aptitude test. T! 
tude tests, for example, 


ists no longer think intelligence 
rather, they measure an unanalyzable 
tional experience. This conclusion holds 
he ability measured by mechanical-apti- 
is partly an outgrowth of mechanical experience. The 
ability measured by clerical-aptitude tests is partly an outgrowth of whatever 
experiences have sharpened a person's perception of fine details. The ability 
measured by musical-aptitude tests is partly a reflection of musical training. 
In practice we cannot disentangle the “natural” from the “ 
ponents of aptitude, though we can think about them Separate]. 

Furthermore, in practical studies designed to Produce tests 


acquired" com- 
y if we like. 
of aptitude for 
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particular ki e 

= p Kinds of work or training, an achievement test often turns out to be 

A mpg of later success in the field in question. Thus, tests meas- 
g what a student knows about subjects taught in high school—English, 


mathemati ; d 5 E 
atics, science, history, and so on—are used successfully to assess apti- 


tude f: 
or college work. A measure of flying information was a valuable part of 
Spelling tests have been useful 


the 
- E battery used to select Air Force pilots. 
x ecting clerical workers. 
hat then is the distinction between aptitude and elena 


Besi 

M. fact that the two varieties of tests have different histories, the main 

Stinction is a matter of purpose. Tests developed and used primarily to 
ts. Test developed and used 


select workers or trainees are labeled aptitude tes 
Ee to find out how much students have learned are called achievement 
. The more general term “ability” covers both. 

Ep the tests we ordinarily classify as aptitude measures are those that 
à ed from well-known major research programs. A large-scale investiga- 
tion of mechanical aptitude, at the University of Minnesota, completed in 
1930 by Donald G. Paterson and his coworkers; produced three tests that 
have been in constant use since that time, the Minnesota Spatial Relations 
Test, the Minnesota Mechanical Assembly Test, and the Minnesota Paper 
Form Board. At about the same time, another extensive research undertaking 
at the University of Iowa produced the Seashore Measures of Musical Talents. 
Other examples could be given, but these two are sufficient to illustrate why 
Certain tests are usually put in the aptitude category today for historical rea- 
Sons. Other research studies contributed clerical-aptitude tes 


ts, methods for 
identifying artistic talent, measures of dexterity and motor coordination. All 
such tests have been mainly classed as aptitude rather than achievement meas- 


ures. 
More useful at present, however, than this distinction based on research 
history is the distinction based on purpose. If we plan to use à test chiefly as 
in some area, we can con- 


a predictor of how well individuals will perform 
less of how its author classified it. If we plan 


sider it an aptitude test, regardl 

to use it mainly to evaluate an individual’s accomplishments or the adequacy 
of his education and experience, it is for our purposes an achievement test, 
even if its author did not think of it in this way- The two purposes are not 
exclusive of one another, of course. We may wish to evaluate and to predict 
for the same person, and the same test may well be useful for both purposes. 


VALIDITY FOR DIFFERENT PURPOSES 


es lies in clarifying our 


roach to special abiliti 
at a test is really valid. 


The advantage of this app 
prove th: 


thinking about the evidence we must have to " 
osts 
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What we are interested in is validity for specific purposes. We must make 
Sure that the evidence the author Presents for the validity of his test is really 
relevant to these ends, Just any kind of validity coefficient is not enough. 
For example, if the author of a test of mechanical comprehension reports 
that the scores of a group of ninth-grade boys in January correlate to the 
extent of .65 with independent ratings of these same boys made by their 
instructor during the same January, we still know nothing about the test’s 
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ington, D. C.: American Psychological Association, Inc., 1945.) 


In Figure 12 we see the outcome of a special predictive Study of a battery 
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(The stanine 


of test . 
5 used to select Air Force trainees during World War II. 
and 


Isa 

9 is NEU nine"—in uh 5is average, one is low, 
OE resili iam oy along with other derived scores in Figure 8, page 37.) 
Use of this set e out to be clear and unambiguous, so that the subsequent 
i; admittin in to select pilots was plainly justified. The figures showed 
the Rihoriine only candidates with high stanines would reduce to a minimum 

I wer n of men who washed out during primary training. à 
Edie intend to use a test to measure achievement rather than aptitude, 
is Ac Sn kind of validity evidence becomes relevant. The main question 
eo er the content of the test really samples the subject matter in ques- 
De *. can tell something about this by examining the questions: But we also 
ected get information about kow the items included in the test were se- 
te ice and who did the selecting. For example, a high school teacher decides 
m a standardized achievement test at the end of his course in American 

y. So he obtains a sample of a test that appears suitable, along with a 
Eu of information about it. If he knows.the authors of this test, at least 
Y reputation, and respects their judgment about the objectives of an Ameri- 
m history course, he starts out with a favorable attitude. If the manual 
Rer that a committee of experts drew up the preliminary outline and 
as consultants in cases of doubtful items, he is further impressed. If 
the authors tried out the first set of items on à representative group of students 
to make sure that the meanings of the questions were clear and that they 
Were neither too hard nor too easy, another point is chalked up in favor of 
the test. 

To summarize, then, an ability test can be considered either an aptitude or 
an achievement test, depending on the purpose for which we desire to use it. 
If it is to be used as an aptitude test, predictive validity must be demon- 
strated, If it is to be used as an achievement test, content validity is impor- 


tant. 
RESEARCH ON SPECIAL ABILITIES 
Occupational Studies 
Let us turn now to a con- 
tests has taught us about the 


sideration of what research based on aptitude 
nature of aptitudes, or special talents, themselves. As indicated above, Te 
Search projects have been set uP to explore such general areas as mechanical 
aptitude, clerical aptitude, motor skill, artistic and musical talent. For a 
time psychologists held high hopes that it would be possible to identify an 

r how much of each ability 


appreciable number of special abilities and discove 
was required for success in each of a considerable number of occupations. 
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To do so would enable them to do a really “ 


scientific” job of guiding each 
young person into the Occupation for which he 


was best suited. Unfortunately, 
as research on the process of Occupational choice has progressed, this hope 
has grown dimmer, Aptitudes, it seems, are more complex, more dependent 
on special kinds of Previous experience, than we first thought they were. 
Many special talents have turned out not to be measurable at all, at least 
by present Procedures. (Efforts to measure social intelligence, for example, 
have failed). Then, too, interests and abilities often differ widely, so that a 
person has no wish to enter an occupation in line with 
Finally, technological change has become so rapid tha 
that a test showing high Predictive validity for a p 
1965 will be related to anything people are doing in 1 


his measured aptitudes. 
t we have no assurance 
articular occupation 1n 
975, 


Factor Analysis 
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As interest in it 

; rci 

Standpoint of occupations has fallen off, resea A 
ased, namely, a concerted effort to analyze gene 


Special abilities from the 
of another kind has incre 


another. Accordingly, the i 
Order to test intelligence adequately, 
different mental abilities. A: 
to do this, but the abilities 


Figure 13 illustrates the kind o 
have given three tests to a largi 
outlined in a previous chapter, 


for Tests A and B, Tests A and C, and Tests B and C. It turns out that all 
are positive, but that the correlati 


than the other two. How might we account for this situ 


a 13 constitute one way of 
IN this. It is the abilities rep- 
E ds these areas, abilities 
a j3 ect one’s performance in 
char an one test, that the factor 
ja E to locate. The situa- 
à ith which he actually works 
pr a great deal more com- 
E than that shown in Figure 
E is more likely to start with 
sider " than with three, and con- 
p -—-—-— the 1225 cor- 
ba coefficients he obtains from 
m — (50 x 49) - 2. After he 
j Ss out which ones have some- 
thing in common, he gives their 
common “factor” a name based on 
his analysis of the reasoning, back- 
ground experience, or special skill 
that seems to be involved in all the 
overlapping tests that determine it. 
In the early years of fac 


about the conclusions investigators W 
that the most satisfactory inference 


tor analysis, there was a goo 
ere reaching. 
from analyses 


Test A 


Test C 
| | 
13. Relationships be- 
hat is measured by dif- 


s as postulated by 
erlying factor analy- 


Figure 
tween w 
ferent test. 
theory und 
sis. 


d deal of controversy 
Charles E. Spearman held 
of correlations was that 
or—what he called “g.” 


g degrees one common fact 


more 


all tests measure to varyin 
Louis L. Thurstone in the 1930's, using 
a more complex method of analysis, conc 


Primary mental abilities. He encountered the S 
studies of college students, high school students, elementary school students, 


and even kindergarten children. He used initials rather than whole words as 
labels: V (verbal), Q or N (quantitative or numerical), S (spatial), P (percep- 
tual), M (memory), and R (reasoning). In the older groups, à verbal fluency 
factor fe called W (words) seemed to be distinct from the verbal-meaning 
factor, V, and there was evidence, he felt, for more than one variety of reasoning. 

Ever more complex factor-analytic studies made during and after World 


numerous and diverse tests and 


luded that there are from six to ten 
ame pattern of abilities in 


War II have shown that Thurstone's view, like that of Spearman, was an 


oversimplification. Fo 


than those listed above, and they ar 
Both the Genera 
Force battery mentioned 
TAB: Guilford has conclud 


but orderly ways. 
service use, and the Air 


factor-analytic methods. 


r there are Clearly many more vari 
e related to one anot 
] Aptitude Test Battery for employment 


eties of special ability 
her in highly complex 


earlier are based partly on 
ed that there are at least 
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that can be differentiated from “ 
the test; by the product, or form i 
by the operation, or kind of menta 


, 


calls convergent and divergent thinking? 
ed in typical intelligence tests have required a subject 

4 measure only convergent thinking. (All the 
onverge upon this right answer if he is to 


Se ah e such 
] '5 most origina] thinking is stimulated by 
“many-answer” questions, Guilf 


ingenious techniques for meas 
Questions like “Think of as m 
and “Think of as many uses fo 


cal situation so that their relationship to real-lif, 
ability tests based solely on factor analysis of 
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Dot tell a 
qe person what he needs to know about the way he is likely to perform 
rion situation and thus may lead to unwise decisions. 


DO DIFFERENT OCCUPATIONS 
REQUIPE DIFFERENT KINDS OF ABILITY? 


erent occupations do differ 
age scores for garage 
n in Figure 14. Cleri- 
n educational ability 
bility as measured 


E evidence has piled up that people in diff 
B edi another an their special abilities. The aver: 
cal wein, and clerical workers on several tests are show: 
Glia i rs, it seems, are well above garage mechanics 0i 
conn reni called general intelligence), clerical ability ; 
overen. erzeheciima test, and three kinds iof dexterity. Garage mechanics, 
inia i are higher in both tests of mechanical ability. On the basis of eus 
vw ike this it is often possible for a vocational counselor to tell a client 
dp er he has the same pattern of abilities as other people in an occupation 
€ is considering, even if he is not able to predict just how successful the 


Person would be on the job. 
A large-scale study by Robert 
E in the book 10,000 Careers gives us s 
i upational groups do indeed differ in patte 
wever, that the degree of success a person wi 
cannot be predicted from his test scores. What Thorndike an 


L. Thorndike and Elizabeth P. Hagen re- 
ome additional confirmation that 
rns of abilities. It also indicates, 
]l attain within an occupation 
d Hagen did was 


for clerical workers and garage 
tional Ability Patterns by B. J. 
arch Institute, Vol. 3, No. 8. Uni- 
yright 1935 by the Univer- 


Figure 14. Occupational test profiles 
echanics. (From: Differential Occupa 
at Employment Stabilization Rese 
ET Wy of Minnesota Press, Minneapolis. Cop. 
sty of Minnesota.) 


2^ 


Test 


© EDUCATIONAL ABILITY 


Pressey Senior Verification 
MINNESOTA CLERICAL 
Number checking ` um 
DEXTERITY. Lt 
Ea 


O'Connor Finger. 


Composite scores 


i (Average) 

AES 50 —40 —30 —29 10 9 +10 +20 430 +40 +50 

| General 

| intellectual 

i Numerical 

H fluency 
= Figure 15. Occupational 
me test profiles for i 
Mechanical. ants and —-—— 

x (Thorndike, R. L. 


IS. 
Hagen, E. 10,000 caree 
z New York: Wiley, 1959, 


bb. 27-28.) 


- 


first to locate in 1955 as man 
Force battery of tests during 


all, and next to find out what 
assessed b i 


Air 
Y as possible of the men who had taken the 


all in the near-zero range. 

, we must remember that me 
gures 14 and 15 are averages. Within each occupa iaa 
greatly. Some manage to meet the demands of a job wit 
en ability than the average worker possesses, others have 


individuals vary 
far less of a giv 
much more. Sint 


to facilitate this sorting process, so that i 

of time and talents, and with fewer cost] 

abilities essential to an ocupation, to co 
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to set mini: 
imum scores on each of these tests. Such thinking led to the develop- 


ment of 
the General Aptitude Test Battery (GATB) by the research staff of 
ing this set of nine aptitude 


th i 

Weer ces Employment Service. In build 

frilly predicts eo used both basic approaches—job analysis leading to tests 

THE hine-tes E validity and factor analysis leading to tests of basic abilities. 

telligence), V in this two-and-one-half-hour battery measure G (general in- 

inper , “(verbal ability), N (numerical ability), S (spatial ability), P 

exteri ception), Q (clerical perception), K (motor coordination), F (finger 
tity), and M (manual dexterity). Minimum scores in the critical apti- 


tud 
es for each particular kind of occupation have been set. Fortunately, the 
haracterizes à number of related 


e z Lii of essential aptitudes often C 
low the task of interpretation is not unwieldy. With the GATB, an em- 
ae nt counselor can compare an individual's profile with'each of 22 Oc- 
22 b onal Ability Patterns that have been identified through research, These 
G asic patterns cover more than 500 jobs. Thus, a person who takes the 
ATB may be made aware of many fields he has never considered, but for 
p he might qualify. He will also find out if he has less than the minimum 
a ci of some special ability that seems to be required in a type of work 
as been considering. For a person thinking about his future, this is good 
return on the investment of about two-and-one-half hours of testing time. 
The GATB shows what can be done when all the knowledge we have accu- 
mulated over the years about special yocational aptitudes is brought to bear 
9n one practical problem. Much validity evidence still needs to be collected, 
however, and the conclusions about a person must be cautiou 


s ones. Like all 
aptitude tests, these are far from infallible. 


SPECIAL CHARACTERISTICS AND USES OF ACHIEVEM 


some of the facts and principles we need to keep in 


Let us look now at 
ts designed to measure achievement— 


mind when we use special-ability tes 
what a person has learned or accomplished. 
Essentially, a standardized school achi 

of the examination à teacher gives at the end of a course. Such tests are de- 
Signed to do accurately and on a large scale what every teacher does rou- 
tinely when he tries to find out how much of the material of a course 

student has mastered. It is no accident that achievement tests have become 
big business in the United States. Our system of universal public education, 
secondary as well as primary, brings teachers and school administrators into 
contact with students drawn from an extremely wide range of home and 
community backgrounds. Yet because our schools are locally controlled, we 


evement test is only à refinement 
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H à 
of such batteries of achievement tests as the m 
Tests of Educationa] Development (ITED), the Sequential Tests of Educ 


wi . 3 
. With the item than poor students are. An item th 


right answer, F x 
er. Furthermore, this answer must not be too easy or obvious if the 


test is really : 
eally to serve its purpose of distinguishing among students at many 
The writing of good items is an art. 


levels k 
of knowledge and proficiency. 
l, like other 


oe tor what to do and what not to do, but the skil 
Once eat skill, is not really analyzable. 

form, tre qe have been Naittens however, À ‘ 
E un e techniques are available to help the test-maker decide which 
hundred e included in the final versions. First hevartanges to have several 
form pal oma of the age or class for which the test is intended take the tzial 
hë Tenin he tabulates their responses to each gua From these tabulations 
itis E indexes of difficulty and discrimination. As for the former, while 

usually desirable to have a mixture of difficult and easy items 1n a test, 
me which are so easy that almost everybody gets the right answer or 
lona jim that almost nobody gets it serve no useful purpose. In the final 
Side ey are omitted. A discrimination index for an item shows whether 

nts who can be considered to be good at the kind of thinking the test 


is ; PS 
designed to measure because their over-all score 1S high are more successful 
at does not discriminate in 


ted. It is not uncommon in 
n one 


and put together in a trial 


elimina 
hoice test to discover that 0i 
more likely to choose one of the 


thi ; z 
e way is dead weight in a test. It too is 
€ course of item analysis of a multiple-c 


arti " 
i" question good students are actuallv 
Ong responses than poor students are. Sometimes this happens because they 


e : 
mploy a more complex and subtle reasoning than the author of the test 
anticipated, Naturally, such items are also eliminated before questions are 


Selected for the final form of the test. 
3 Once a satisfactory set of items has been assembled, the rest of the steps 
in the construction of an achievement test are much like those discussed in 
Tri 3. The test-maker must explore reliability, and express it as a re 
Miata, eenn or standard error of measurement. He must specify 
ardized procedures for administering and scoring the test. He must work 
Out a meaningful derived score based on a representative sample of the 
Population for which the test is intended. And he must prepare norm tables. 
if sound decisions about what is to go into the test 


have been made all along the line, content validity is insured. 

We have been considering achievement tests from the producer's view- 
Point. But what does the consumer—teacher, school administrator, or the 
testee himself—need to keep in mind in evaluating these tests and the scores 
Obtained from them? There are so many achievement tests available for 
Measuring what people know about nearly all school subjects at all educational 
levels that a choice is often difficult. The primary consideration is the one 
Stressed in the preceding paragraphs—content validity. In deciding which of 


As explained previously, 
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the available batteries of achievement tests to use, for example, a school E. 
ministrator examines the test and reads very carefully the section In d 
manual in which the authors explain how the basic outline was deve 
and how the questions were chosen. If he finds that this task has been i 
well, he can at least consider the test as a possibility. He will then £o oR ü 
evaluate its other characteristics how reliable it is, how convenient it is ds 
administer and Score, what the norm groups are like. Perhaps what Py 
to be stressed most is that to judge the content validity of a test we mus e 
more than examine the test questions themselves. What seems to be a ques vat 
with a simple factual answer may call for a complex reasoning process: W d 
seems to be an adequate assortment of questions on different principles E 
Concepts may be quite overbalanced in one direction. Content validity rests br 
the whole set ‘of procedures used in planning and constructing the test. Fac 


ions—is 
validity alone—what one can judge simply from looking at the questions 
Dot an adequate basis for choosing a test. 


TERIES 


Because there is no clear line 
and because specia] 
in schools and coun 
for use in educati 
high schools is cal 
eight separate tes 
Abstract Reasonin 


miscellaneous collec- 
or all tests are based 
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on the 5 
same representative group of people. Thus, the scores that show where 


an indivi ; ; P ; 
E ividual stands in relation to this norm population are comparable from 
o : ks 
test, and we can say with assurance, for example, that Hugh ran 


` much higher i : . 
gher in mechanical reasoning than in verbal reasoning. 


E 2 several others of this combination type, and more are likely to 
; S ni able as time passes. As the search for innate abilities has been 
e. , another question has come into clear focus. What is useful to 
asure about a person in order to facilitate decisions—those that others must 
E about him and those he must make about his own life? In this vein, 
Bits it to understand what kinds of special ability he possesses, regardless 
di this ability was developed. The concept of special ability, which em- 
aa what can be done with talents rather than where they came from, 
E gradually take the place of the aptitude and achievement concepts that 
played so important a part in the development of testing methods in 


the past, 

Right now one research undertaking in progress should eventually throw 
à good deal of light on questions about how special abilities are used. It is 
Called Project TALENT. In the spring of 1960, 440,000 students in 1353 
high schools spent two days taking tests. The tests in the TALENT battery 
had been carefully designed and pretested in order that they might tap as 
many abilities as possible; 37 separate scores Were obtained for each subject. 
The schools in which testing was done were carefully selected to constitute a 


Complete cross-section of secondary schools in the United States. Besides the 
test data, many kinds of information were collected about students, schools, 
ànd communities. The plan is to follow up these testees in 1961, 1965, 1970, 
and 1980 to find out what has happened to them—the careers they chose, the 
education they obtained, their achievements and contributions to society. 
When the analysis of this formidable body of data is complete, it should tell 
S more than we have ever known before about what happens to people with 
different special talents under different circumstances. At each stage of the 
Project, reports are to be published to keep the public informed about what 


has been learned so far. 
The development of special-al 


awareness by Americans that society as a W 
encouraged to use their talents effectively and accomplish as much as they 


Can. In this very real sense, what is good for John Jones is good for the 
Country. To the extent that they can help us achieve this joint individual- 
Social purpose, such tests are worth the effort and ingenuity expended on 
them, Accordingly, they occupy an important position in the total structure of 


Psychological measurement today. 


bility testing has coincided with an increasing 
hole benefits when individuals are 
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Personality Tests As we look around us at 


the people we know—and as we examine our own lives— 
it is apparent that some of the most distinctive personal 
qualities cannot be classified as abilities at all. We usually 
do not know or need to know the score our next-door 
neighbor makes on a clerical test, but we do find ourselves 
making some necessary estimate of his friendliness, neat- 
ness, cooperativeness, generosity, dependability, and at- 
titudes about politics. Such qualities are essential to the 
business of living. For Success depends on them fully as 


much as it does on Competence, And socia] problems such 
2 
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as cri 4 . 
me and mental illness involve faulty personal habits in individuals rather 
y to adulthood, 


th s 

fe bees Clearly, as children grow from infanc; 

Thus, it is irri is at least as important to develop as intellectual maturity. 

to measure strange that psychologists should have taken up the challenge 
personality traits as well as abilities. 


SPECIAL PROBLEMS IN PERSONALITY TESTING 


En is some ambiguity about the meaning of «personality test," an 
lie dens y arising from differences of opinion about how “personality” should 
ao If we take the broad view and define personality as the total 
cerns that includes all of a person’s characteristics, the tests we have con- 
intelli In previous chapters would be classed as personality tests. Obviously 
of egi special talents, and achievement in wariolls fields are components 
nde al personality. But it is convenient to consider separately the functions 
iia which a person uses these abilities—his motives, emotional stability, 
mr udjustiient techniques, and the like. To test such qualities we need 
SR ent techniques from those we use in measuring abilities. One leading 

SOS Lee J. Cronbach, calls them tests of typical performance to distin- 
8uish them from ability tests, which are tests of maximum performance. 

The attempt to measure personality characteristics, has run into many 
Special difficulties. First of all, the method that worked so well from Binet's 
time on—placing a testee in a standard situation and obtaining an actual 
Sample of his behavior—generally is not applicable here. We simply cannot 


Set up in the testing room the standard situations in which personality traits 
of the most important of these 


Sede likely to manifest themselves. Many noti : 
"im s are social in nature and show up only when an individual finds himself 
E. Certain kind of group. In order really to sample such a trait as self-control, 

Instance, we would have to staridardize situations that were irritating, 
frustrating, annoying. (Military psychologists tried to do this during World 
War II by making insulting remarks to persons who were attempting to con- 
Centrate on complex psychomotor tasks, but their efforts were not successful. 


testing, the use of sample situation: 

room can be similar i tial respects to the problems a person €n- 
Counters in special situations in life. But in testing personality it is difficult to 
reproduce the social situations in which personality chiefly manifests itself. 
There has been some © h in which this was attempted, but 
so far no generally usabl 


hallenging researc 
e tests have emerged from it. 
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The way makers of personality tests have most frequently taken to get 
around this difficulty is to substitute reported behavior for observed behavior. 
Instead of actually trying the testee out to see whether, for example, E 
becomes angry when someone makes disparaging remarks about him, Wes 
him to Zell us how he reacts in such a situation. We include several questions, 
each dealing with a different Specific manifestation of self-control. We bbe 
a score for him on the trait by counting how many of these manifestations he 
reports, If there are ten questions on holding one’s temper under provocation, 
the person who says “Yes” to nine of them scores higher in self-control than 
the person who Says “Yes” to only five. 

This method of testing Personality by asking questions about behavior and 
feelings in various Situations was invented at about the same time as group 


recruits fot emotional stabili 
Personal Data Sheet, was t 


of general adjustment, neur 
and health. 


ty. This first collection of questions, called e 
he progenitor of a long line of subsequent tes 
oticism, and other traits related to mental illness 


This method—asking a testee to answer questions about his behavior and 


feelings in life Situations — carries with it some testing problems unlike those 


encountered in ability measurement. The history of personality testing, indeed, 


fforts to solve these special problems. The most 
obvious, though perhaps not the most Serious, snag is that a person may not 
tell the truth about himself. In measuring abilities we do not have to wor 
about this. It is not likely that a boy with an IQ of 90 will be able to “fake 
an IQ of 120. Unless his mind works at the 120 level, he is not able to 
answer the questions so as to attain that score. But it is quite possible for 4 
fearful person to give “brave” answers to questions about his reactions to 
danger, or for a withdrawn person to say “Yes” to the question, “Do you have 
many friends?” If a person's suitability for a job, a college, or an exclusive 
club is to be judged on the basis of his personality test Scores, he is under à 
real temptation to “fake good.” 

Military psychologists also found out th, 
tempted to “fake bad.” Some draftees who s 
Psychiatric disabilities on the initial Scree 
healthy when more penetrating methods of examining them were employed. 


They had hoped to be classified neurotic and answered the questions toward 
that end. 


At at times a person may be 
howed what looked like serious 
ning test were adjudged quite 


Fortunately, psychologists have devised way 
answers have been faked, so a score can be eith 


deception. We shall mention some of these devi 


S of discovering whether 
er discarded or corrected for 
ces later in the chapter. Here 


d "i. 


we can simpl WT 
ply note this difficulty as one of the special complications en- 


À person a oe e -— turned Qu to be even more difficult to surmount. 
teristics Psy a E z describe hisiowm motives and emotional charac- 
aware, a large part es e wishes and dutends to do so. As Freud made us 
Bünssen weites di of our motivation is unconscious. And the pull of un- 
with strong de m a person's view of his own personality. The pan 
individualist = ent needs often sees himself as an aggressive, self-reliant 
relations ai : e; woman with a powerful undercurrent of hostility in her 
B aemenis er family may see herself as gentle and self-sacrificing. Thus, 
umen. not mean what it seems to mean. ; : 

press in ae enough, some of the characteristics people unconsciously ex- 
therefore eed to test questions are common to many ‘testees, and are 

urable. Various kinds of such response sets have been identified. 


Alle 
biles L. Edwards studied one of these in detail, à set he called social desira- 
y. He found that since persons who grow up in our society almost in- 


Lo ape what is considered to be the “right” way or behaving and 
their n. e answers they give on any kind of personality inventory reflect 
tional tas to this awareness, perhaps more than they reflect actual emo- 

aracteristics. A second response set that influences personality test 


Scores i. ; s. 
€s Is acquiescence, that is, the tendency to agree with what someone else 
“yes” rather than “No.” Still another response set that has 


e, or the tendency to give unusual or uncommon 
It is clear, then, that we cannot automati- 


r face value. We must have some way to 
the other tendencies, besides out-and-out 


Says, to answer 
> explored is devianc 
E take personality questions. It 
Ed e personality scores at thei 

or to make allowances for 


faki r 
aking, that influence these scores- 
x Another, rather different, difficulty psychologis 
3 z 
sonality tests has been the lack of unambiguous criteria for evaluating 


ese ose As we have seen, one standard method for validating ability 
Phe tr P icontelate test scares aus nontest indicators of the trait measured. 
€ a ition began with intelligence tests. Binet took it for granted that the 
Foi ae was trying to measure was the same as the trait teachers observed 
eir bright, average, and dull students. Thus, he could find out whether 
his tests were measuring this characteristic by correlating test scores with 
teacher judgments, a5 expressed in school marks, promotions, and the like. 
In the case of most of the personality qualities we might wish to measure, 
difficult to obtain criterion measures from life. We can, of 
to rate their students, employees, fraternity brothers, Or 
mism, dependability, or general adjustment. But such 
any ways as measures of personality. First, 


ts have run into in devising 


however, it is 
Course, ask people 


neighbors for opti 
ratings are unsatisfactory in m 
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* =" E rson, Te- 
raters often disagree markedly in their evaluation of the same Lan s aes 
flecting their own attitudes perhaps as much as the pe an ue Mes 
person they are rating, Second, ratings show what is usually calle 


r s to re- 
effect—that is, a person who elicits generally favorable attitudes tend 


ceive high ratings on everything; a person not generally liked is rated y 
everything. Furthermore, it is difficult to persuade raters to spread di 
judgments over a range of numbers; if the rating scale runs from 1 D i Wo 
hesitate to give anyone a “1” or a “9” and may rate everybody “5, ‘it 
“7.” Had ratings been more Satisfactory as a way of measuring persona Es 
Perhaps there would not have been as much pressure to develop d 
tests themselves. Ratings are stil] widely used, though, both for —— 
personality in practical Situations and for exploring the relationships of " d 
sonality tests to other measures. One does not discard a dull tool ine ir 
sharper one is available. But in interpreting ratings we must always take f v 
deficiencies into consideration. They are not really satisfactory criteria 
personality tests, die 
One way of investigating the validity of personality tests is to comp: 


^ rait. 
defined Broups of people singled out by society on the basis of some t! 
This method has been partic 


tendencies, for the extreme 
for psychiatric treatment, A 
adjustment and that psychi 
then evidence that they o 
becomes evidence for the 
There are 


tests for negative than for positive pers 
It is much easier to discover, for exa; 
undercurrent of anxiety, a tendency to: 


onality traits. 


mple, that Lawrence Wylie has a strong 


Ward paranoid thinking, and more than 
the usual number of psychosomatic Symptoms than it is to assess his de- 


pendability, loyalty, and leadership Potential. We must al 
this bias of Personality measurement toward 
Positive side whenever we use personality tests. 

To some extent, this negative bias is being corrected as 
Tn one area, the measurement of vocational 


ways keep in mind 
the negative tather than the 


research continues. 
interest, E y Strong was able 
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to capitali . 
D tmd iuum existence of occupational groups in constructing the Strong 
determine life E a valuable test for measuring motivations that 
Clelland and hene In another large research program, David C. Mc- 
measuring it i tackled the achievement motive and devised ways of 
Here they mene shall have more to say about these particular tests later. 
round out the fi only as examples of a trend in research that will eventually 
eld of personality measurement SO that we can assess favor- 


able trai 
ts as adequately as unfavorable traits. 


= PERSONALITY TESTS 
By Traits Measured 


“| CATION 
TION O 


CLASSIFIC/ 


E^ personality traits 
pletely diee by tests? It is not possible to answer this question cor 
not cl , OF literally hundreds of tests are available, and in many cases it is 
headi ear just what they do measure. We can, however, outline some main 
i, ngs under which most of the titles would fall. There are, first of all, 
y tests of what we label “general adjustment" or “emotional stability,” 


wm it would probably be more accurate to call them measures of mal- 
E instability, or neuroticism, since they are made up of questions 
situati to mental illness. Although their usefulness varies from sius to 

tion, evidence is abundant that tests of this sort often help in the diag- 
nosis of what might be called vulnerability to psychiatric difficulties. The 
Second main group of personality tests, overlapping to some extent with the 
first, consists of measures of social or unsocial tendencies. Here again it is the 

ips that suc 


difficulties in social relationshi h tests show up most clearly. So 
an ways to measure unusual social sensitivity Or real talent for social rela- 
tionships have eluded us. A third group of tests, sharing common ground with 
both of the first two types consists of measures of motives, OT basic needs. 
A fourth and much smaller group consists of measures of values. ^ fifth 
variety, developed quite independently of the others, is vocational interest 
tests, A sixth variety is composed of attitude scales. These include measures of 
interpersonal attitudes as well as approaches toward things and issues. 

In addition to these main classes, We would have to provide à large mis- 
cellaneous category to do justice to the enormous amount of exploratory re- 
search that has occurred. Varieties of temperament, character traits, psycho- 
sexual characteristics, ways of dealing with frustration, defense mechanisms, 
reactions to authority—for these and many other aspects of personality 
measurement techniques have been devised. But for most of these qualities 

practical, as dis- 


we do not as yet have tests that are ready to be used in 
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tinguished from research, situations. Browsing through psychological journals 


‘an find a test for almost any trait we 
can think of (and probably for a good many others we haven’t). While only 


Instead of classifying per- 
€ can classify them according 
ses here—the inventories and 


bjective performance tests, physiological measures such 
ograph or galvanic skin response, situational tests, and 
the inventories and the projective tests are the tools psy- 


PERSONALITY QUESTION) 
The MMPI 
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in age and soci 
soc : 
ial class but not mentally ill, answered the same questions. 


The auth 

of the 5 = pis carefully tabulated the answers each group gave to each 

difference in fr ons. Only the items that had a clear, statistically significant 

Were scored as Miis of response between psychiatric and normal groups 

'S scored on th icators of the psychiatric trend. For example, à response 
e scale for Paranoia only if paranoid patients give it more 


fr 
ipee an E | 
Problems rid n = empirical way is to by-pass some of the persistent 
Persia spia in the previous pages. Instead of having to struggle with 
behavior?” E s the person really telling us the truth about his attitudes and 
tion is a bit d shift our position and say instead: “His answer to this ques- 
what it is r veel eee in its own right. Let's see if we can find out 
say about sie to.” We search for behavioral correlates of what persons 
bout wheth emselves, and give up the essentially unanswerable question 

lion: er consciously or unconsciously, they are telling the “truth.” 

npe e MMPI nine scales have been constructed in this manner to meas- 
2 s irs toward different kinds of psychiatric difficulty. In the course 
E ies "pra research projects, psychologists have added @ number of other 
Versus cens on other nonpsychiatric group comparisons, such as introverts 
COMM m and prejudiced versus unprejudiced. The experience clini- 
Bonon undreds of agencies have had with the test has resulted in the 
AM on of an appreciable amount of practical knowledge about the 
E. ing of different combinations of high and low scores. These combinations, 
a of scores have turned out to be of more value than any OF the 
in is scores. A schizophrenic patient, for instance, may not score any higher 
"d. e Sc (schizophrenia) scale than à neurotic pateng does, but his pattern 
Dm is distinctive enough so that à skilled interpreter can get useful in- 
sith bn about him from examining this profile. Figure 16 on page 78 shows 
profile and the inferences on personality 4 clinician can draw from it. 
Oe ue developing the regular scoring keys, they worked out several spe- 
Sis os to help the interpreter correct for faking and unconscious response 
bent e method for detecting faking happily E applicable to many other 
faci as well as Qr MMPI. Essentially rather simple—a direct extension, in 
, of the empirical method by means of which the personality scales them- 
d—it employs item analysis. T! ] procedure is 


he genera: 
ke the test twice—once 


under standard in- 
ons to bias their scor 


selves were constructe 
es. They might be 
tions in such a way as 


to ask a group of subjects to ta 
structions and once under instructi 
told, for example: “Answer the ques 
able a score as possible." Item by item, res 
ditions are tabulated. When the proportions are com) 


to make as favor- 
he two con- 


pared, differences on 
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^ Figure 16. An MMPI 
profile. From such a a 
ord, a psychologist wou 
suspect that the a 
was not entirely hones 
in his responses (high L 
and moderately high K 
scores), was neurotic 
i rather than psychotic 
(high Hs, D, and Hy, 
with average Pa, Sc, an 
Ma), and was suffering 
: from psycliösormote 
E POT NN symptoms (high Hs an 
Hy). 
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T score 


Some items arg Statistically 
"fake good" key. 


The procedure outlined above was 


, ial 
significant, These items are used in the spec 


one of the sources of items in the K 
scale of the MMPI. A high K score thus may indicate defensiveness og un- 
willingness to admit one's personal weaknesses. A low K score may indicate 
excessive self-criticism or à deliberate attempt to "fake bad." It is always 
necessary to analyze what high and low K scores 
cases, however, A person with an unsually high K 
good. He may really be good! 


The MMPI has other Special scales 
Scale, fo 


may mean in individual 
Score may not be faking 


aracteristics that a oe 
to them. It measures a less subtle 


The Strong Vocational Interest Blank (SVIB) 


^ i 1 The Strong blank is also 
an empirically scored inventory. Its history is Considerably longer than that of 
the MMPI. Soon after World War IE. xk, Strong and some other psycholo- 
gists studying occupational differences happened upon an interesting fact— 
ersonality 


Tests " P N 


that di 
EI "ps bee n un showed consistent differences in 
Bets sene wf Pone 5 bs = ; SEE of these differences were what 
salesmen should say that ha e cx j us natural that more engineers aia 
ferences also turned up o 5s « se dart ud ve ser ae 
ems concemed sith p on items having Do apparent connection with jobs— 
E ck Gis mik amisements, hobbies, people, books, and many other 
of life as w i uch findings suggested mag a profession may represent a Way 
le to ell as a way of earning a living. Strong saw that it would be pos- 
B iT aD these characteristics related to occupational choice. In a 
m rine program of research extending over many years, he worked with 
em nae occupational group after another, Compare the responses they 
ediistructt e test questions with the responses given by men in general. In 
EE E a scoring key pr the Architect scale, for example, he asked sev- 
their undred practicing architects to take the test. Item by item, he tabulated 
Read oat: to find out which ones they answered “Like,” which “Indif- 
;' and which “Dislike.” Any answer for which the difference between 
architects and men in general was statistically significant was included in the 
Architect scoring key. Later, Strong devised a special form of the test for 
Women, and developed scales for women's occupations in the same Way. 


N ; : 
orms for all the occupational groups studied were provided, and the relation- 
lities, and many other human char- 


Ship of interest scores to age; special abi 
orm, the Strong Vocational Interest 


acteristi 4 
cteristics were explored. In its present f 
0 items drawn from several areas of life—occupa- 


kinds of people, work situations, and so on. 
k for Women consists of some of the same 
f what women do in our 
en’s blank, 26 for the 
o takes one of these tests receives, in 
tion about whether or not he would 


of these lines of work. In 
n who ob- 


Blank for Men consists of 40 
tions, school subjects, amusements, 
The Strong Vocational Interest Blan 
items, along with others that are more characteristic o 
society. There are 45 occupationa 
women’s blank. Thus a young person wh 
one package, a large amount of informa 
be likely to feel that he fits in or belongs in one 
Figure 17 on page 80 we see a Strong profile and what the perso! 


tained it could conclude about himself. 
With the SVIB, as with the MMPI, the empirical approach has yielded 


supplementary keys for scoring other things besides the occupational char- 
acteristics the test is specially designed to measure. From comparing item 
tabulations of boys of 15 and men of 25 or over, an Interest Maturity key 
took - shape. Comparing men's responses with women's produced the Mas- 
culinity-Femininity key. As with the occupational scales, we know what an 
individual's score on one of these keys means if we are aware of how it was 


constructed. For example; when a college 
ter of his group on Interest Maturity, we know that at presen 


] keys for the m 


freshman scores in the bottom 
t he is more 


quar! 


79 


Personality 
Tests 


Report on Vocational Interest Test for Men 


Case no. 


Faw | Standard] 
Seore| Sere 


c m 


^ 


Anist 
Psychologist (rev) 


1 
Architect 
Physician (rev) 


50 


55 60 


iz 


Paychiatrist 


Osteopath 


Dentist 
1t [Physicist 
Mathematician 
Engineer 
IM | Production Manager 
1v 


Farmer 


Printer. 


City Schoo! Supt. 
Ministor 

NI | Musician. 

VII |C.P.A. Partner 
Senior CP.A. 
Junior Accountant 


Purchasing Agent 
Banker 
Mortician 


fa | 


Sales Manager 

Real Estate Sismn. 
Life Insurance Slemn. 
X | Advertising Man 
|Lawyor 

Author Journalist 
XI |Prosident 


n 


Occupational Level 


Specialization Lovel 
Interes! Maturity 


|Masculinity-Femininity | 0 | 
pen 


TE BSE ER: 


i 


IT 


Figure 17. Profile of scores made b 
à y an. 18-year-old college studen 
te Mah mei Interest Blank. While the A and B+ pete a 
ae ad pore possibilities, the combination of Physician, Chemist, and 
ms pen h Artist, Musician, and Author, along with the A on Architect, 
eia — appear to be the most promising choice, since it combines 
Pera i v: erg interests. The young man entered an architecture course 
oe ell with it. (Reprinted from Manual for Strong Vocational Interest 
s for Men and Women by E. K. Strong, Jr., with the permission of 


Stanford University Press, © Copyright 1959.) 


we cannot predict from Interest Maturity score 


like a boy than a man. But 
pattern as he proceeds through 


alone how likely he is to grow up in his interest 
college. Some men, after all, remain “boyish” all their lives. 

On the basis of extensive research extending over more than'three decades 
We know a number of things about vocational interests, as measured by the 
Strong blank. First, patterns of likes and dislikes are not primarily the result 
of participation in an occupation, but exist before a person enters it. For 
example, Stanford students who were later to enter medical school and be- 
come physicians, scored high on Strong’s Physician scale even in their under- 
graduate days. Most people, it seems, develop their individual interest patterns 


before their high school days are over. 
Second, for most persons, such interest patterns, 
Permanent as any aspect of personality that has been studied. Strong kept in 


touch with many of the Stanford students he tested in the 1930’s and from 
time to time he wrote to them and asked them to take the test again, so that he 
could find out whether they had changed. Although he did usually find minor 
changes, and although a few individuals- showed completely altered interest 
Pictures at different times, the great majority did not change substantially 


Over periods as long as 22 years. 
Long-term research has also rev 


do and do not predict. With a few exception 
how successful a person is likely to be in an occupation or in the training pro- 


gram leading to it. (Perhaps we should never have expected that they would. 


An individual score expresses how much a person resembles an occupational 
of men successful enough to stay in the field. De- 


nsidered.) What the scores do predict is how likely 
particular occupations or shift to others. Further, 
that the correlation between interest scores and 
ery high, evidence suggests that those who 
btained high Strong scores are on the whole 
re others whose interest scores did 


what Strong scores 


ealed a great deal about 
do not tell us just 


s such scores 


group made up entirely 
grees of success are not cO 
individuals are to remain in 
although some studies indicate 
self-rated job satisfaction is not v 
are in occupations on which they o 


better satisfied with their positions than a 
not point in the particular direction they took. 
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We have discussed the 
MMPI and the Strong tests as outstanding examples of what the inventory 
or questionnaire, approach to personality measurement has accomplished. 
They are tests in which at least some of the special measurement problems S 
considered at the beginning of the chapter have been solved by selecting 
items that differentiate between groups already sorted out by society—for 
instance, Schizophrenics versus Normals, or Dentists versus Men-in-General. 
But there are other tests representing this same approach to measurement 
problems. The California Psychological Inventory, for example, is similar to 
the MMPI in many ways, but the questions of which it is composed have 


nothing to do with psychiatric illness, and the special groups on whose re- 
sponses the scoring keys are based are nor 


mal people—high-achieving and 
low- 


achieving students, for example. This test represents a forward step in bo 
direction of measuring positive characteristics of people rather than just 
Symptoms and deficiencies. 

Besides the empirical approach there are other question-and-answer routes 
to personality measurement. One way is to explore through correlational 
methods the associations between items in a given test. If it is discovered, for 
instance, that of 500 people who have taken the test most of the persons who 
answer “Yes” to the first question also answer “Yes” to questions 5, 8, 19, 20, 
and 23 and “No” to questions 3, 11, and 17, then these clusters of items a 
pear to have something in common. The test-maker asks himself, *What is 
it?" Often he can tell simply from reading the items themselves. The next 
step is to isolate this little cluster of items and make a separate scale of it, 
perhaps adding other items that appear to be similar in order to make it more 
reliable. An example of a test constructed in this way is the Guilford-Zim- 


merman Temperament Survey with its Separate scales for General Activity; 
Restraint, Ascendance, Sociability, 
liness, Thoughtfulness, Personal R 


ple is the Kuder Preference Recor 


Persons with various kinds of 


stable occupational differences, 
> 


for example, 


it in decidi i 
ing what traits to measure and what questions to ask pertinent to 


those trai > 
raits. This was how most projective tests began (see pages 84-87). 


But it h 
as often been used as an approach to paper-and-pencil tests also. 


One e: : 
xample is the Edwards Personal Preference Schedule (EPPS), designed 
h of 15 of the personality needs 


b 

a p to measure the strengt 
nang bebe A. Murray in Explorations in Personality, needs like 
BE us , Affiliation, and Order. In using such a test in practical situations 
em edem that personality theories themselves Dave not been *vali- 
what p is n we nd be on guard against making inferences about 
E -— scoring high in a certain trait is likely to do in an actual 
E er un ess we have some definite evidence that tlie trait in question really 
high on ae such situations. To hire a secretary simply because she scores 
ribs. W eed for Order" as measured by EPPS, for exampfe, might bea 
. We are not really sure that there 1s such a trait as “Need for Order,” 
ure how it would be expressed in our particu- 
tolerant of necessary confusion that 
r fellow workers, and her boss. 


i if it exists, we are not S 
^ se It might make its possessor so in 
ould make life miserable for herself, he 

The Edwards Personal Preference Schedule illustrates another aspect of 
ELM testing that has taken on great importance in current work. As 
aa before, Edwards has collected a large amount of evidence which 
te tes that the answers people give to personality questions are often 
gely determined by à conscious OT unconscious effort to present a socially 
desirable picture of themselves. In order to rule out the effects of this tendency, 
Edwards gave a preliminary try-out to all the questions he intended to use in 


the test. 
He asked his guinea pigs to evaluate the social desirability (SD) of each 
of the answers to each question. (Previous research had shown that people 
were stable from group 


could make such evaluations, and that the SD ratings 
to group and from time to time.) Then as he built his test, he put together 


Pairs of items with similar SD ratings. The testee is required to choose one or 
the other in each case, rather than just to say “Agree” or “Disagree” to each 
of them. Thus, a high “Need for Achievement" score on the EPPS shows that 
the person chose a5 descriptive of himself many statements about achievement, 
even when he had to pass up statements about equally admirable qualities. 
This method of presenting i forced-choice, has been used 
by various psychologists; in ra l as in self-descriptive inven- 
tories. 

Flourishing research aimed at overcoming other response sets, such as those 
discussed earlier in the chapter, is widespread. As a result, it is now possible 


nts of the tendencies to answer “Yes” rather than 


ke direct measureme 
to give deviant rather than common answers, and to take 


tems, usually called 
ting scales as wel 


to mal 
“No” to questions, 
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extreme rather than moderate positions when asked how strongly one agrees 
or disagrees with various statements. Personality testing by questionnaire has 


definitely come a long way since Woodworth proposed his Personal Data 
Sheet in 1918. 


PROJECTIVE TECHNI 
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Let us turn now to the other main branch of personality testing, the pro- 
jective methods. With these, instead of asking the testee questions, the ex- 
aminer presents to him some ambiguous stimulus and asks him to interpret it 
or react to it. Inkblots and pictures of human situations have been the most 
common stimuli used. For example, see Figures 18 and 19. Since an inkblot is 


not really a picture of anything, the interpretation a person gives must come 
from within himself and thus ex- 


press something about the way he 
perceives and organizes his indi- 
vidual world. Similarly, since the 
picture in Figure 19 does not make 
it clear what the two persons are 
doing or saying to one another, 
any story a person makes up about 
it must stem from his own thoughts 
and feelings. This suggests the rea- 
son for the label of this type of 
test: The testee is said to project 


into the picture his own emotional Figure 18. Inkblot Number 7 

attitudes and ideas about life. Some Us tha Rorschach series, (Hans 
* Huber Publishers.) 

have argued that selj-expressive 


would be a better over-all designa- 
tion for these methods, but the title projective is old, and it has stuck. 

The first of these tests to be widely used, and still probably the most popu- 
lar of them all, is the Rorschach test. This is a set of ten inkblots of various 
Shapes. Some are black and white only, some have color as well, The subject 
is asked to tell what he sees in each one. After he has said as much as he wishes 
to say, the examiner asks as many questions as are needed to 
what the subject saw in the blot and what aspect of it dete; 
ceptions. Since 1921 when Hermann Rorschach, a Swiss Psychiatrist, proposed 
this technique, a number of somewhat different Scoring systems 


have been 
developed, and a tremendous number of books and journal articles have been 


analysis of 


make clear just 
rmined his per- 


written about it. The various scoring systems provide for separate 
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Figure 19. A picture 
similar to those used in 
projective personality 
tests. (Photo courtesy 
Standard Oil Company 
of New Jersey.) 


r content. Structure has to 
ponding to this 
does he have many responses or few to 
gure as a whole or to its parts? To 
color, and shading? Content 
he blots as human figures, animals, 
her concrete things. The degree to 


the structure, or style, of the responses and of thei 
do with such questions as: How productive is the person in res] 
sort of stimulus? In other words, 
each catd? Does he usually react to each fi 
what extent does he base his responses on form, 


pertains to the extent to which he sees t 


anatomical diagrams, maps, 
which a person tends to give either pop 


interest. 
It should be evident from even this brief description of the many aspects 
kblots that the job of interpreta- 


's responses to in 
dertaking. For one thing, there is no way in which 


at once. As research studies have accumu- 
arent that a good many of the inferences 
he basis of this test were unjusti- 
? to personality secrets that its 


clouds, or ot 


ular or original responses 1s also of some 


to analyze in an individual 
tion is a highly complex un 
a test like this can be validated all 
lated over the years, it has become app’ 
Clinicians were once accustomed to make on t! 


fied. It has not proved to be the “Open Sesame 
more enthusiastic practitioners thought it was going to be. But in spite of 


its inadequacies it is still a valuable tool for practicing clinicians. Used along 
with other techniques for assessing personality, such as interviews and back- 
ground information, it furnishes clues about matters that can profitably be 
explored in studying an individual case. The Holtzman Ink Blots, brought 
out in 1961, correct some of the deficiencies of the Rorschach technique while 


keeping many of its advantages. 
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Another widely used projective technique is the Thematic Apperception 
Test (TAT). It consists of cards showing pictures of people. What the situa- 
tion is in each case and how the persons are feeling about it are purposely left 
undefined so that the testee can have an opportunity to read in his own 
attitudes and ways of perceiving the world. His task is to make up a story for 
each picture, including in his account some explanation of what led up to the 
illustrated situation, what the characters are thinking and feeling, and what 


the outcome will be. Figure 19 shows a picture similar to those used in the 
TAT. 


As with the Rorschach, many ways of analyzing and interpreting a person's 
TAT stories have been explored. Generally, though, the examiner looks first to 
see which character has been made the protagonist of the story, since that is 
probably the person with whom the story-teller identifies. He then studies the 
details of each story carefully—what the subject said and how he said it— 
deriving hypotheses therefrom about the individual's needs, feelings, and 
attitudes, and noting how often the same trends occur in different stories- 
Most clinicians use this information as a basis for a verbal description of the 
testee rather than a score or set of scores. Researchers, however, give over-all 
ratings for the personality traits in which they are interested. 

'The research possibilities in this general method have been exploited in 
many ways. Special sets of pictures for children, Negroes, and persons from 
other cultures have been tried. Methods of scoring for specific characteristics 
have been standardized. A good example of the use of the story-telling method 
in research is the work of McClelland and his associates on achievement moti- 
vation, to which we referred earlier (see page 75). They presented to their 
research subjects pictures carefully selected to stimulate the invention of 
stories about achievement. Then they analyzed each person's stories according 
to a standardized system. They considered the desires the main character 
expresses, the activity he engages in, and the obstacles he encounters, along 
with several other features. Thus, they obtained a score for each subject on 
Need Achievement. In a continuing program of research they have been study- 
ing the ways in which these scores relate to the people's backgrounds and 
subsequent achievements. 

We could mention many other particular varieties of projecti ut 
any complete discussion would carry us far beyond our aer pes un 
thing we can say of all the projective tests: Their val i 
They arë por thererore ready to be applied in making decisions about people 
in practical situations as teachers, social workers, and employment managers 
must do. In the hands of z competent psychologist with à wide background of 
knowledge about personality structure and functioning, they can often con- 
Person can attain skill 


idation is incomplete. 


tribute to such decisions. But there is no Way in which a 


- a 


in the use of projective tests without this background of psychological knowl- 
edge. The taking of a single course in Projective Techniques certainly does 
not provide it. 

Perhaps the greatest value of projective techniques lies in the contribution 
they make to research on personality. For they allow psychologists to get 
some clues about things people are not able to talk about directly—their 
underlying motives, assumptions, and ways of looking at the world. 


MISCELLANEOUS WAYS OF ASSESSING PERSONALITY 


In addition to these two principal kinds of personality measures, inven- 
tories and projective techniques, there are some direct methods«of evaluating 
personality, methods that have been in use for centuries and are still used 
today. These are observation, interviews, and ratings, all of Which, it has been 
repeatedly demonstrated, have serious weaknesses. Mainly they arise because 
such methods involve observers, and the personality of the observer plays at 
least as great a part as the personality of the observed in the final product. 
We see this in everyday life. Two bystanders witness the same automobile 
accident and tell different stories about what the driver of the car actually did. 
Two interviewers talk to the same applicant and come out with widely dif- 
ferent assessments of his competence and attractiveness. Two English teachers 
read the same theme; one marks it A, the other D. 

We could multiply instances of this kind indefinitely. But it would be 
unwise to conclude that we should never use such evaluations of personality. 
There are too many situations in which they are the only possible methods by 
which we can get some information about personal qualities. No tests are 
available, for example, that will tell a school superintendent whether a teacher 
he is thinking of hiring has the necessary characteristics for stimulating stu- 
dents in a classroom. Repeated efforts have been made to develop personality 
tests of the qualities good teachers show, but they have been only partially 
successful. Thus, the superintendent must rely on letters of recommendation 
and interviews in making his evaluations. 

There are many other situations where, unless or until better personality 
tests become available, ratings, observations, interviews, and letters of recom- 
mendation will continue to be employed. Earlier in this chapter we discussed 
some of the inadequacies of ratings as criteria for personality tests. Clearly, 
if it is necessary to use rating scales to assess personality traits not otherwise 
measurable, we must give some attention to remedying these defects. Dis- 
crepancies between raters will be less troublesome if we make it a practice to 
secure several raters and use the average of their judgments. Clear instruc- 
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tions about just what is to be rated also help. Judges can be forced to Hem 
ute their ratings over a wide range rather than to mark everybody “average 

or “above average" by asking them to rank the individuals to be evaluated 
or to divide them into, say, five groups each made up of a specified number of 
people. Or the forced-choice method we have explained when discussing the 
Edwards Personal Preference Schedule can be used for ratings as well as for 
self-reports. In the teacher-rating problem mentioned above, for example, one 


item to be rated by the supervisor who has been in charge of a student's prac- 
tice teaching might be: 


In which of these ways is the applicant more competent? 
Writes clear detailed lesson plans. 
Wins cenfidence of his class. 


Both of these qualities are socially desirable but one is probably more fun- 
damental to success as a teacher than the other is. To construct such a forced- 
choice scale, of course, we must first collect a great deal of information both 
about the actual social desirability of certain kinds of behavior and about 
traits that are really related to success at a particular task. Developing any 
kind of good rating scale is not easy. 

Although psychologists have devoted more attention to rating scales than 
to the other miscellaneous ways of obtaining information about personality, 
they have done some research on observations, interviews, application blanks; 
and recommendations. The purpose, of course, has been to increase the ac- 
curacy of these techniques and to get rid of some of the sources of ambiguity 
and error. One way of doing this is to clarify the instructions given to ob- 
servers, interviewers, and persons filling out application blanks. Another is to 
analyze which observations, which questions, and which answers show predic- 
tive validity for some particular criterion. Personnel psychologists have found 
ways to score the answers applicants for a certain Position give—prospective 


insurance salesmen, for example—by using tabulations of the answers given 
by successful and unsuccessful workers previously hired. 


Behavior observations of many sorts still constitute an indispensable source 
of information about personal qualities. It seems unlikel 


ly that testing tech- 
niques will ever replace them completely. 
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and Measurements ` TESTS AND INDIVIDUAL DECISIONS 
Every day thousands of decisions are made that rest at 
least partly on evaluations of a person’s psychological 
characteristics. Some are decisions about people, as when 
an employment manager selects a new man for a job, or 
the staff of a mental health clinic plans the treatment of 
a new patient. Some are decisions by people, as when 
Jerry makes up his mind that he will enter graduate school 
instead of taking a job as a petroleum geologist, or 
Beverly shifts her major from biology to political science. 
In any of these situations, a well-chosen test in the hands 
of someone who understands it may be of enormous value. 
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Decisions abcut People : 
When the decision in- 
volves the selection of one or more individuals from among a group of appli- 
cants, the contribution a test makes can be analyzed in a simple straightfor- 
ward way. Whether the selection is for a college, for a job, or for a therapy 
group, the purpose of the decision-maker is to pick out people who will fit 
into the place he has in mind for them. He does not expect to be right every 
time, only most of the time—he wants to select more successes than failures. 
Tests can sometimes improve his batting average even if their reliability 
and predictive validity are not especially high. Consider, for example, the , 
expectancy table shown in Table 6. If the person who does the hiring selects 


TABLE 


How a Stenographic Proficiency Test Can Contribute 
to the Selection of Good Stenographers 


Number in each score group receiving 
each rating on Stenographic ability 
Below Above 

Average Average Average Excellent 


Percentage in each rating group 
that fallin each score group 
Below Above 
Average Average Average Excellent 


Stenographic 


Proficiency 
Test Scores 


100 100 

This expectancy table shows the number and Percentage of stenographers of various 
rated abilities who came from specified score groups on the S-B Stenographic Proficiency 
Test. (N = 52, mean score = 15.4, S.D. = 2.9, r= 61; score is average per letter for five 
letters.) It indicates that if only girls who scored at least 14 had been hired, none of them 
would have been rated “below average” and many would have been rated “above aver- 
age" or "superior." From Psychological Corporation Test Service Bulletin, No. 38, Decem- 
ber 1949. 


only those applicants who make scores of 14 or hi 


à gher on the test, he can 
expect that all of them will succeed and none of th 


em really fail. In a tight 
labor market, he may have to hire people who get scores 


f lower than 12 in 
order to get enough workers to keep things going. In this case, though, he 
anticipates that there will be a larger number of failures, but sun bd H 3 he 
as there would have been if he had not used the test. For 


the employment 
Applications i inl tool to use in increasi ploy 
manager a test is mainly a tool to asing th 
of Tests E Y E the number of successes, 
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and lessening the number of failures. He is not concerned about the meaning 
of individual scores and patterns of scores. He realizes that some errors in 
selection occur as the test is used—that some persons with high scores do not 
make good—but he does not worry about these failures as long as there are 
not too many of them. Another possible error does not bother him at all—that 
because of low test scores he may have turned down a few persons who might 
have been outstanding successes. 

There are many other situations besides those pertaining to 
education where tests contribute to decisions about people. In clinics the 
basic question is: "Just what is wrong here?” When a mentally ill patient 
enters a hospital, the first step in treating him is to try to find out what he is 
suffering from. If his general symptoms suggest either schizophrenia or an 
organic psychosis arising from a brain disease, his psycholugist uses tests 
which, previous research has shown, differentiate between schizophrenic and 
ostic decisions are typically made by teams 
logist alone, the clues obtained from testing 


are supplemented by clues coming from other types of examination. In this 
case, an analysis of the patient’s electroencephalogram (a record of electrical 
activity in the brain) and a physician’s report about the patient’s illnesses 
that might have produced brain damage would almost certainly be brought in. 
Tests are also used in this diagnostic way by reading specialists who try to 
pin-point what a slow reader’s trouble is so that they can teach him what he 
has not learned in the regular schoolroom, by speech therapists who suspect 
that personality characteristics are relevant in the case of a stutterer, and by 
many other clinical and educational workers. 

Such clinical uses differ from the selection situations discussed above in that 
test results need only furnish clues or hypotheses, which are always checked 
against other kinds of information about the person being studied. Thus, it is 
even more true here than in selection situations that tests may be of some 
value even though they are not highly reliable or completely valid for the 
purpose for which the examiner is using them. His ingenuity in thinking of 
ways to track down new clues about his patient’s difficulty is what counts. He 
may even improvise some testing procedures that have never been studied 
systematically at all. If such procedures furnish hypotheses he can check 
against facts drawn from the patient's behavior, symptoms, and history, they 
have served their purpose. But that some novel unstandardized procedure is 


a workable test in a particular case (or even in 10 or 100 cases) does not 
constitute evidence that it is a useful tool for decision-making in all compara- 
ble situations. In diagnostic testing many instruments are in fact resorted to 
that do not meet the standards we set up in Chapter 3. This charge applies to 
a large proportion 


work and 


organic patients. Since such diagn 
of experts rather than by a psycho 


of the projective tests in clinical use. Therefore, we must be 
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careful not to conclude that they would be valuable when decisions are made 


entirely or mainly on the basis of test results alone. It is imperative that hy- 


potheses based on tests that are less than adequate be checked against other 
sources of information. 


Decisions by People 


When decisions are to be 
made by an individual, as when a person seeks counseling, the utilization of 
test information is more complicated. Even when a test has a fairly high 
validity Coefficient, each test score has quite a range of criterion scores as- 
sociated with it. This is illustrated in Table 6 on page 90—persons with scores 
in the 10-11 class ranged from Below Average to Above Average on the 
criterion variabie, 

In struggling with the problem of whether to take the job as an oil company 
geologist or return to Graduate School to work for a Ph.D., for example, Jerry 
may wish to compare his general intellectual ability with that of other gradu- 
ate students by means of a test like the Graduate Record Examination, But 
how shall he use the information given him that his score is 450? If norm 
tables tell him that this is just a little below the average level for graduate 
students in geology, he cannot be certain of below-average grades. Tabulations 
of what students in the past have done show that some students with scores 
like his come out with Straight A’s, others with C minus averages. Which kind 
is he? Error in either direction in his decision will be costly to him. If he 
decides to go ahead with graduate work and finds after a year or two that he 
is not going to be able to qualify for a Ph.D., he will feel that he has wasted a 
considerable fraction of his life. Further, he may not at that time have as 
good a position offered him as the oil company is offering him now. If, on 
the other hand, he decides against graduate work on the basis of his below- 


Some Guiding Principles 
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Clearly, the Standards we 


apply in selecting tests should be set according to the planned uses of test 


results. We can never say, “This is the best intelligen 
it for everything," or “This is a good test of emotio 
We always ask, “Who is to use the results of this tes 
before we can say anything about its probable contr -making 

For all the classes of decision we have been co; we can lay Cen 


» a 


some general guide lines about using tests. In selection situations, it is ad- 
visable to try out a proposed test without using the scores for selection pur- 
poses. After we find out how testees in this trial group turn out, we can set up 
an expectancy table like that in Table 6. This will show how many failures 
there were altogether and how many there would have been had a particular 
passing score been set in order to disqualify some applicants at the outset. 
We can then analyze the probable effects of different, minimum scores before 
deciding what procedure to use in the future. 

In making preliminary tryouts, it is advisable whenever possible to com- 
pare several tests. Often some combination of two or three turns out to be 
more effective than any one alone. Financial and practical, considerations 
always enter into decisions, of course. It may be that the use of Test A will 
reduce the failure rate from 29 per cent to 10 per cent. But if the examination 
must be given to each applicant individually, it may make better sense to use 
instead a short group quiz, Test B, even if it only reduces the failure rate to 
14 per cent. In dollars and cents the cost of the extra failures may be lower 


than the cost of the individual testing. 


We must often weigh the expense of a testing program to eliminate un- 
promising candidates against the cost of training. If a job merely calls for a 
ays, the use of tests to select particularly apt 


persons is usually not worthwhile. On the other hand, if long training periods 
and costly equipment are needed to make a person skillful, even a highly 
expensive testing program to pick out the most promising candidates for train- 
ing can be justified, as in the Air Force selection program discussed in 


skill anyone can learn in a few d 


Chapter 5. 

In summary, a good rule when using tests to select people is: Get informa- 
tion about their specific validity—what they accomplish in a particular situa- 
tion. A good rule in diagnostic situations. is: Seek out as many different kinds 
of information as possible on the given problem. Use tests as one source of 
ideas about the case, but do not rest the final decision on test results alone. 
m. 
help him make decisions about his 
arefully. It is 


Corroborating evidence must support the 

Where a person wishes to use tests to 
the most important principle is to choose tests very c 
here that the standards set up in Chapter 3 are most essential. For a test to 
to vital decisions, it must be reliable enough so that 
an individual's score gives a fairly accurate indication of the person's relative 
standing in terms of the characteristic being measured. It must have a long 
enough history to make plain just what trait it is measuring. And we must 
know what predictions can reasonably be made from it of progress along 

medium, and high scores. 


specified lines for persons with low, 
What we do not always recognize about tests as tools is that a test is not 


own life, 


contribute substantially 
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built all at once like a house or a machine. It develops over a period of years, 
and its usefulness grows as experience with it increases. We could range all the 
varieties of test we have considered along a scale that would show how well 
developed they are at present. A brand-new test, no matter how ingenious an 
idea it represents or how great the social need for it, would be at the bottom 
of such a *development" scale. A test like the Staníord-Binet, which has been 
in continuous use since 1916, would stand at the top, for a well-trained ex- 
aminer familiar with the large body of knowledge that has grown up around 
it knows what it measures and what kinds of decisions it facilitates. . 
As we have seen, situations differ in the degree of development required in 
a test. For diagnostic work, well-developed tests are to be preferred whenever 
they are available, but even tests low on the scale can serve as sources of 
hypotheses if Such hypotheses are carefully checked against nontest data. In 
selecting people for jobs, we also prefer to use well-developed tests, but we 
can bring in newer ones if we are in a position to run tryout studies. In coun- 
seling, well-developed tests are most essential, although even here a new test 
can sometimes prove a fruitful source of ideas for the person himself to check 
out against nontest evidence. In such a case, however, it is especially impor- 
tant for the counselor to know enough to make clear which kinds of test 


" i n 
results cannot stand alone. Tests are tools, and their value depends chiefly o. 
the skill of their users, 


From the earliest intelligence tests on, the principal purpose of such instru- 
ments. has been to compare individuals in situations like those we have been 
discussing. But almost immediatel 
to facilitate research on 
before the era of psychol 


y another possible use became apparent— 
age-old problems of group differences. For centuries 
logical testing, people had been arguing about whether 
or not women were intellectually inferior to men, the dark races to the white 
races, and the peasant and artisan to the aristocrat. Men had built up complex 
social structures on assumptions of such group differences, Skeptical thinkers 
had again and again questioned these assumptions. Yet neither side had had 
any way to prove or disprove them. Perhaps, it was now hoped, the new 
Psychological tests would be exactly the tools that had been needed a 
Perhaps at last these vexing questions could be settled. 
Investigators tackling group differences have turned 
more than any other variety. For one reason, genera] co; 
tual capacity seem to fit most closely the old controvers; 


ll along. 
to intelligence tests 


mparisons of intellec- 
les—one question, for 


pu PF UC~«~N 


example, that during the nineteenth century seemed most urgent to many 


educators was whether female intellects were really equal to the strain of college 


education. For another reason, intelligence tests were available earlier than 


any of the others. Special-ability tests came later, personality tests later still. 

The evaluation of the mass of evidence about group differences is a complex 
task. One source of ambiguity is the nature of tests as measuring instruments. 
Early workers did not see any problem here. It was only after tests reached a 
fairly high stage of development that researchers began to be concerned about 
it. One difficulty is the matter we took up in Chapter 1— that the numbers in 
which test scores are expressed do not constitute measurement scales in the 
sense that measurements of quantities like height and weight do. But before 
it was clear, first, that there are different levels of measurement, and second, 
that even though tests do not usually give us ratio scales, we can work legiti- 
mately with interval or ordinal scales, it, looked to many honest researchers 
as though mental tests were simply unusable as instruments of science. 

An even more serious difficulty is whether the content of tests makes them 
suitable for group comparisons. Now, all tests are made up of questions or 
tasks to be presented to subjects. As explained earlier, it became apparent 
soon after mental testing was invented that an individual's success with such 
tasks depended on a complex combination of inherited potential and experi- 
find out where one individual stands, we can 
choosing a test whose content is appro- 
priate for his background and then comparing him with appropriate norm 
groups. But when we attempt to compare two groups, the question about 
whether the content and norms of the test are equally appropriate for both 
groups becomes serious. There is no altogether satisfactory way of answering 
it. As we saw in Chapter 3, any test score—MA, IQ, percentile rank, stanine, 
and so on—tells us where a person stands in relation to a particular reference 
group with regard to a particular tested characteristic. Even if we make our 
norm groups representative of the total population, we do not automatically 


render the questions fair to minorities within the population. 
For a while, many psychologists assumed that verbal tests were most 


subject to criticism on these counts, and that if nonverbal, or nonlanguage, 
tests could be devised, group comparisons would be fair, regardless of subjects’ 
cultural backgrounds. Experience has clearly indicated that this assumption is 
unsound—that children must learn, for example, to see details in pictures 
and to manipulate geometrical forms rapidly and accurately, just as they must 
learn to talk, to read, and to figure. Cultural groups that give their children 
little or no experience in looking at pictures or in putting odd-shaped blocks 


together cannot be considered inferior in innate ability if their young aver- 
usual norm groups, which have been made up of 


ence. If our only concern is to 
overcome this problem fairly well by 


age lower scores than the 
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children who obtain such experience as a matter of course. It seems unlikely 
now that we will ever find a type of test question or task that can be considered 
really “culture-free,” or as some prefer to call it, “culture-fair.” 

Must we then give up the aim of exploring group differences by means 
of mental tests? While there might be some variation among social scientists 
in their answers, the majority would probably say “No.” What is called io 
is a program of research in each area of doubt. We ought first to find ou 
what differences exist and then to carry out a long series of Special studies 
designed to tell us what these differences mean. On the critical topics of = 
race, and social-class differences, such programs of research have occurr 
mainly by chance, and each new study ‘has added something to what Mer 
previously known. In the paragraphs to follow, we shall try to make clea 
where we stam! now on each of these three issues, recognizing that the re- 


d and that 
Search on which our knowledge rests has been somewhat haphazard an: 
the last word has not yet been said. 


Sex Differences 


Let us look first at what 

clusion 

we know about differences between the sexes. The first -—e 
We can state is that carefully constructed intelligence tests do 


Figure 20. Average 
scores made by 6900 9th 
grade boys and 7400 9th 
grade girls on the Differ- 
ential Aptitude Tests. 
(Bennett, G. K., Sea- 


Differential Aptitude 


E = a Tests, 3rd ed., p. 26. 
E 8E 3 d HE 2 = i B E Psychological Corpora- 
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' men score higher on theoreti 


High 


; Average 
Figure 21. Sex differ- 
ences on the Allport- 
Vernon-Lindzey Study of ie 


Values. (Allport, G. W., 10 
[ [--[ ae RENI 


Vernon, P. E., and Lind- 
zey, G. Study of values, Theoretical Economic Aesthetic Social Political Religious 
‘Average Male Profile ——— Average Female Profile 2 —— — 


3rd ed. Boston: Hough- 
ton Mifflin, 1960.) 


significant discrepancies in total score between males and -females. There 
ores on special abilities. 


is, however, some divergence in the pattern of sci 
If a test is made up of several separate subtests, the profiles for the sexes char- 
acteristically differ even though the over-all score does not, as Figure 20 
graphically shows. The nature of this difference in pattern suggests that it 
may arise from educational influences and from the characteristic expecta- 
tions families have for boys and for girls. ne with 
stereotypes about the sexes—that females are more verbal 
males more mathematical. 
Personality tests, too, have been used in many studies comparing men with 
women or boys with girls. The general conclusion we can draw from these 
studies is that the sexes differ more in personality traits than they do in 
abilities. For example, Figure 21 shows the kinds of results that have been 
obtained with an inventory measuring different kinds of values. Women 


get higher scores than men do on aesthetic, social, and religious values, but 
cal, economic, and political values. And there 


h vocational interest tests also leads to con- 


e from personality questionnaires supports 
re aggressive than 
res of neurotic 


For the differences are in li 
1, for example, and 


are other instances: Work wit! 
clusions of sex differences. Evidenc 
observations of behavior that males are considerably mo 


females. Females usually score higher than males on measu 
tendency. And so on with many other kinds of personality traits. 

Few research workers would at present question the existence of per- 
sonality differences between the sexes. What they are more interested in now 
is their sources. At first questions about the origin of sex differences were 
usually formulated in some simple “either-or” manner—for example, “Are 
differences biological or social?” But we have come to realize that such state- 
ments are far too simple. Sex differences are both biological and social. Boy 
and girl babies begin life with differences in body structure and body chem- 
istry, and their patterns of growth are dissimilar. Partly because of these 
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facts, partly because of the com- 
plex training of their own sex roles, 
parents treat the two somewhat 
differently and expect different 
things of them. By the time chil- 
dren have reached nursery-school 
age, they have developed ideas of 
their own about what it means to 
be a boy or a girl. So we need 
to put all these factors together to 
explain what sex differences in 
average scores mean. A great deal 
of research is going on in this area, 
motivated in part by the thought 
that if we can understand the com- 
plex factors in this differential de- 
velopment, we may also get clues 


Figure 22. A comparison o. 
the distributions d = 
the DAT Test of Space Rela- 
tions for 9th grade boys and 
girls, (Bennett, G. K., Sea- 
Shore, H. G., and Wesman, A. 
G. Manual for Differential Ap- 
titude Tests, 3rd ed., p. 26. 


to understanding other complex 
aspects of personality. Questions 
like “How much and what kind of 
influence do fathers have as com- 
pared to mothers?” come to mind. 

In thinking about sex differ- 
ences, we must always keep in 


Psychological Corporation.) 
mind that most figures give us av- 
erages, and that within each sex 

there is a large amount of variation. Graphs of whole distributions of tests 

Scores nearly always show a great deal of overlapping. If we look carefully at 

» We find that the highest girl in the group scores far 


ary Eustis, should be admitted to 
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e is “What score did Mary make?” 


In the United States com- 
ade again and again. To 
answer is clearly “Ves.” 
States and Canada, when 


standardized intelligence or achievement tests are given groups of Negroes 
and white school children or adults (as in the armed forces), the average 
for the white group is significantly higher than the average for the Negro 
group. (Individuals, of course, vary widely.) 

If we are to go beyond this fact of differences in averages, however, we 
must look at other research investigations designed to throw light on what 
the differences mean. When the reports of Negro-white comparisons appeared 
in the 1920's, most psychologists were still assuming that intelligence tests 
measured pure innate ability, unaffected by experience or training. If this 
were indeed the case, then one could conclude without too much doubt what 
Americans and Europeans had been concluding for centuries without benefit 
of test evidence—that black men were intellectually inferior to white men. 
But as it became disturbingly apparent that intelligence test scores reflect 
ehces as well as innate potential, the easy conclusion about 
erences became less tenable. Research workers 
periments’;—special situations of some sort— 
potheses about the meaning of 


environmental influ 
the biological origin of race diff 
began to search for "natural exi 
that would enable them to rule out some hy 
the findings and to support others. 

One of these hypotheses was that Negroes do less well than whites be- 
cause the most common tests are unfair to them. If this were the main reason 
for the differences, we might expect to find certain tests in which Negroes 
equal or exceed the white averages. This particular hypothesis has not stood 
up very well under investigation. Although particular studies often show 
that Negroes are more handicapped on some kinds of test item than on 
others, on no type of question or task that is useful in measuring intelligence 
do Negroes, as a group, excel. The difference in average scores does not 
seem to be a mere matter of verbal or spatial ability, or speed of response. 

A more productive line of investigation opened up when it was noted that 
the averages on the Army Alpha test used in World War I were actually 
higher for Negroes in some northern states than they were for whites in some 
southern states. For example, the average for white recruits from Arkansas, 
Kentucky, and Mississippi was about 41. Negroes from New York averaged 
44, Negroes from Ohio 49, Negroes from Illinois 47. Two possible explana- 
tions for these findings occurred to those who studied them—one, that su- 
tended to migrate to Northern cities, the other, that educa- 
orth led to better performance on intelligence tests. 
gate these hypotheses. These studies 
he educational factor that is really 


perior Negroes 
tional advantages in the N 
So special studies were set up to investi 


have fostered the conclusion that it is t à 
crucial. Research by Otto Klineberg in New York and by E. S. Lee in Phila- 


delphia showed clearly that among Negroes in Northern schools those who 
had attended for long stretches scored higher than those who had been going 


for shorter periods. 
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There is some evidence, however, that differences in the amount and quality 
of formal schooling do not account for all the obtained Negro-white differ- 
ences. H. A. Tanser in a study of a Canadian town where Negro and white 
children had attended the same schools since Civil War days reported higher 
averages for white children in spite of the apparent equality in educational 
opportunity. There are probably other and more complex factors than legal 
equality that determine how well children can take advantage of the emus 
cational opportunities they have. For one thing, a good deal of education 
occurs before children enter public school. In very poor homes where there 
are no books, no pictures, and few toys, and where little special attention is 
given to stimulating children's mental growth, the “aptitude for education” 
we measure by means of intelligence tests probably does not develop satis- 
factorily. Because the socio-economic level of Negro families is lower than 
that of white families in most communities, such a handicap for Negro chil- 
dren might well occur. Lee’s study in Philadelphia showed that the highest 
Scoring group of Negro children was composed of Philadelphia-born children 
who had attended kindergarten. And these children consistently scored about 
three IQ points higher than the Philadelphia-born group who had not gone 
to kindergarten. Perhaps the kindergarten experience provides preschool 
education often lacking in poor children's homes. À 

Here again it is important to remember that there is a great deal of varia- 
tion in intelligence within the Negro groups, just as there is within the white 
group. Averages never tell the whole story. It is not at all uncommon for in- 
dividual Negro students to make extremely high scores on intelligence tests 
and to stand at the top of their classes. In all the group comparisons we have 
mentioned, in the distributions of scores a substantial minority of the Negro 
scores fall above the mean for whites, and a considerable number of the 


white scores fall below the mean for Negroes. To evaluate an individual we 
must test him individually. 
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hear we must be somewhat familiar with the whole body of research in this 
area rather than just isolated pieces of it. 

Decisions about social policy are ultimately grounded on many cons 
tions, not the least of which are ideals of justice and equality. Research re- 
sults, whatever they are, do not determine such decisions. Their main value 
for policy-makers, however, lies in helping them predict what the probable 
consequences of various alternative programs will be. Studies of Negro-white 
differences on measurable traits can serve this purpose. 


idera- 


a 
Social-Class Differences 


Another research topic 
that has received special attenti 
the problem of relationship of social cl 
to find group differences, but not so eas 
so has required a long sequence of related studies. 
One thread runs through them all: When we sort people out according to 
their social status in the community and give them intelligence or achievement 
tests, we find fairly large differences between the averages of the different 
groups. In both world wars, comparisons of men with different occupations 
were clear on this point. On the General Classification Test used in World 
War II,.for example, accountants and lawyers averaged about 128, clerical 
workers about 118, electricians and other skilled craftsmen about 109, miners 
and farmhands about 91. When the children of men from different occupations 
are tested in school, they show the same differences as their fathers. Terman 
and Merrill, for example, found, in the carefully selected sample of the popu- 
lation on which Stanford-Binet norms are based, the following average IQ's 
for two- to five-and-a-half-year-old children from families at different occu- 


pational levels. 


on since the early days of mental testing is 
ass to intelligence. Here, too, it is easy 
y to determine what they mean. Doing 


Professional 115 
Semi-professional and managerial 112 
Clerical, skilled trades, retail business 108 
Semi-skilled, minor clerical and business 104 
Rural owners 98 
Slightly skilled pi 


Day labor, urban and rural 
And they discovered similar differences for other age levels. Further, studies 
in which the social-class rating is based on community regard as well as on 
occupation alone yield the same intelligence differentiation between children 
from middle- and upper-class families and children from lower-class families. 


Finally, farm children typically average lower than city children. 
The subsequent investigations of the meaning of these differences have 
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principally revolved around two hypothetical pivots. The first is that intelli- 
gence tests, constructed as they usually are by middle-class psychologists and 
educators, are unfair to lower-class children and thus constantly produce 
underestimates of their intellectual potentialities. According to this line of 
thinking, what we need most are tests made up of problems and questions 
more representative of the attitudes and thought processes of lower-class 
children. A large-scale research project at the University of Chicago directed 
by Kenneth Eells analyzed the responses children from different social classes 
made to test questions. Eells and Allison Davis then constructed a new test 
they called the Davis-Eells Games, which was made up exclusively of the types 
of problems on which the classes differed least. As Figure 23 shows, the test 


Figure 23. Sample items from the Davis-Eells Games Primary A, by Allison 
Davis and Kenneth Eells. (New York: Harcourt, Brace & World, Copyright 


1952. Reproduced by permission.) Which girl is starting the best way to get 
the rich dirt to the flowers? 


is made up of pictures of real-life situations and practical, common-sense 
questions about them. Unfortunately for the hypothesis on which it is based, 


in tryouts of the test in a number of communities, lower-class children still 


average lower than middle- and upper-class children, as on the more con- 


ventional intelligence tests. Furthermore, scores on the Davis-Eells Games do 
not predict success in school work as well as other intelligence scores do, so 
its practical contribution to educational decisions is limited. This whole line 
of research has made the hypothesis that social-class differences arise pri- 
marily from the type of test used look less promising than it once did. 
Arguments about whether social-class differences in intelligence and school 
achievement arise from hereditary or environmental sources are less common 
than they once were. Our awareness that the intellectual development of any 
individual under any circumstances involves a complex combination of the 
two factors has settled this controversy along with m 
challenges educational researchers is the quest for wi 
environment more stimulating to lower-class child 


any others. What now 
ays of making the school 
ren than it typically is, 
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We need to know, too, why the children of some lower-class families move 
up the social ladder to become much more successful than their parents 
were. And we need to know why some drop out of school at the earliest pos- 
sible time, apparently immune to formal education. Studies of creativity, of 
achievement motivation, and of many other complex attitudes and motives 


are all relevant to these problems. 
In this field too, we must always keep in mind the fact of overlapping 


between the social classes. It is not just a Horatio Alger myth that some poor 
boys become presidents—of corporations, of universities, or of the United 
States. That there is a wide range of talent within each social group is particu- 
larly important because of the vast difference between the classes in absolute 
size. That is, there are far more poor families than rich ones, far more skilled 
workers than professional men. In a careful study of a midwestern com- 
munity, Robert J. Havighurst and L. L. Janke (1944) identified the follow- 


ing class structure: " 
Class A. Wealthy families (2 per cent of the population) 
Class B. Professional men, officials, and leading businessmen (6 per cent of the 
population) 
Class C. Small businessmen, lesser professional workers, some skilled workers, white- 
collar workers (37 per cent of the population) 
orking and respectable people (43 


Class D. Semi-skilled workers and laborers, hardw 


per cent of the population) 
Class E. Lowest occupa! 


cent of the population) 


tional groups with poor reputation in the community (12 per 


Even if it were true that half the children from Classes A and B and only 


one-tenth of the children from Groups D and E were bright enough to qualify 
for the best colleges and the most demanding positions, there would still be 
a larger absolute number of D and E children in the select group than of A 
and B children. For half of 8 per cent yields four persons out of every 


hundred, and one-tenth of 55 per cent yields five. As the need for trained 
ll parts of our society has become more urgent, the whole 


distribution of intelligence ratings in each social class has come to seem more 


important than the average rating for the group. We cannot run our complex 
civilization without developing the talents of individuals from the groups 


with the lower averages. 
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It was assumed to start with that elderly adults would do less well than 
young adults on intelligence tests, as they do on physical examinations and 
strength tests. Most of the research done in the 20’s and the 30’s led to the 
same conclusions. W. R. Miles and C. C. Miles, for example, reported that 
although the average IQ for 20-year-olds with an eighth grade education was 
100, the average for 50-year-olds with the same amount of schooling was 
90, and the average for 80-year-olds only about 75. Similar differences were 
found in age groups with more education. Twenty-year-olds with college train- 
ing averaged about 120; 50-year-olds about 110. The facts that persons in 
the older groups had grown up in very different circumstances, had not be- 
come “test-wise” as children, and had perhaps not answered examination 
questions for years were ignored when the results were discussed. 

But as in the studies of sex, race, and social class, research on the mean- 
ing and origin of these age differences gradually accumulated. Some investi- 
gators analyzed the results with different kinds of test materials and different 
kinds of testing conditions, and found that on all performance tests older 
subjects ranked lower than on verbal tests; similarly, on speed tests they 
rated lower than on power tests. On the other hand, in several studies older 
testees actually averaged higher than younger ones on the vocabulary tests 
that constitute an important part of many general intelligence measures. 

In the 1950’s the results of several longitudinal projects were reported— 
studies in which the same adults were tested at different periods of their lives. 
In the most elaborate of these William A. Owens located 127 men who had 
taken Army Alpha at the time they entered college right after the end of 
World War I and retested them with the same test 30 years later, when they 
were in their late 40’s. Instead of scoring lower than they originally did, these 
men averaged higher on the test as a whole and on most of its subtests. 

What such studies show is that there need not be an inevitable decline in 
mental powers from youth at least through middle age. Educational influences 
operate here, as in the other group comparisons we have considered. The men 
in the study referred to above were for the most part college graduates who 
probably continued to educate themselves by reading, thinking, and convers- 
ing after their college years were over. They lived in a period during which 
there was a constant increase in the amount of educational stimulation avail- 


able in most parts of the United States—books and libraries, c 


orrespondence 
and extension courses, 


radio and television programs, Tt Seems probable now 
that in earlier studies one reason persons over 50 scored so much lower than 
younger adults was that they had grown up and spent mo. 
a much less rich educational environment 

Older and younger adults have been compared on tests of special abilities 
and personality as well as intelligence. On the average, scores for groups of 


st of their lives in 
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persons who are at or beyond middle age usually turn out to be poorer than 
those for younger groups if the test requires speed of response, perception of 
details, or the solution of novel or unfamiliar problems. Even in these special 
areas, however, adults of all ages seem to be able to maintain skills they 
acquired early in life. Thus, they have less of an occupational handicap than 
| they are often thought to have. In personality characteristics and interests, 
the differences between adults of all ages are small, except for the very old 


| who show more personality difficulties of various kinds—anxiety, social isola- 
tion, neurotic and psychotic trends. These, however, may well be outgrowths 
at come with retirement, bereave- 


_ of the strains and difficult circumstances th 
ment, and ill health rather than inevitable psychological handicaps concomi- 
tant with aging. 

Within any age group there is, once again, wide variation from person to 
person in all characteristics for which we have tests. We simply cannot judge 
how intelligent, how skillful, or how well adjusted a person is by considering 
age alone. All retirement plans run into this difficulty. Many individual 
65-year-olds are well above the mean for 25-year-olds even in speed and 
flexibility, in which youth as a rule clearly excels age. Again we must em- 
phasize the same conclusions stated for sex and race differences: Differences 
within a group are larger than differences between groups. The assessment of 


an individual must be an individual matter. 


NSHIP BETWEEN MENTAL 


THE RELATIO 
AND PHYSICAL TRAITS 


When measurements of psychological characteristics became available, it 
was natural that men should have used them in the search for answers to some 
questions that had intrigued thinking persons for years—questions about the 
relationship of physical and mental qualities. Diverse theories and various 
bits of “folk wisdom” needed to be tested. Is a sound mind most likely to 

oes too rapid intellectual growth stunt bodily 


develop in à sound body? D 
different build tend to differ also in temperament, 
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mental traits are too slim to be of practical value in judging people. There is, 
it seems, a slight tendency for desirable traits to go together. High intelligence. 
for example, is a little more likely to appear in the larger and stronger children 
than in the smaller and weaker ones. But the obtained correlations in studies 
on this matter are seldom higher than .20. And there are so many exceptions 
to the general trend that any combination of mental and physical abilities can 
occur. Some Phi Beta Kappas are healthy; others are sickly. Some athletes 
are brilliant; others are not. Some studious girls are beautiful; others are 
plain. . 

Evidence points to a somewhat closer relationship between body build and 
temperament—that is, emotional and motivational characteristics rather than 
abilities. The classification of body builds most widely used at present is the 
one worked out by William H. Sheldon. In this system, endomorphy is the 
predominance of soft roundness in body structures, mesomorphy gp: pres 
dominance of muscle and bone, and ectomorphy the predominance of thinness 
and fragility, Sheldon's research seems to demonstrate that the more sendo- 
morphic a person is, the more he is relaxed, friendly, and pleasure-loving in 
his behavior and relationships with other people. Mesomorphy is related to 
vigor and adventurousness, ectomorphy to intellectual and introverted traits. 
When we consider all the available evidence along with Sheldon’s, however, 
we are forced to conclude that such relationships between physique and tem- 
perament are not very close. In some studies the correlations obtained seem 
mainly to reflect the inclination of persons taking part in the research, es- 
pecially those doing the rating, to expect certain relationships and so un- 
consciously to slant their data to produce them. On the whole, we can say, 
as we did after discussing group comparisons, that if we really wish to know 
how intelligent, adventurous, athletic, or friendly an individual is, the soundest 
Way to find out is to evaluate that trait directly. We cannot judge a person 
by appearance much more accurately than we can by sex, race, or age. 
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a child with what appears to be hereditary tendencies toward emotional insta- 
bility can be trained in habits that will keep the instability from handicap- 
ping him in the business of living. He can learn, for example, to clench his 
fists and count ten instead of throwing a temper tantrum, to stay out of social 
situations that irritate him, to choose an occupation where he can work at his 
own pace. How susceptible à trait is to modification through learning does 
not usually depend on the extent to which it arises from either nature Or 
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grounds of the testees, investigators still hoped to be able to arrive at some 
figure that would express what the relative weights of hereditary and environ- 
mental determiners were. They thought for a time that it would be possible 
to make statements like this: “Intelligence is determined 80 per cent by 
heredity, 20 per cent by environment.” Most of the persons who have worked 
on this problem have now given up this aim. The trouble is that adding up 
components does not produce a person. It is the interaction that is important 
—the continuous transformation of a complex pattern in which hereditary 
and environmental elements are no longer distinguishable. But even though 
this research on the hereditary basis of intelligence did not achieve its pur- 
Pose, it produced a body of information and a set of research tools that are 
useful in our study of a variety of questions relating to heredity, 

The most significant findings have come out of studies of twins. Identical, 
Or monozygotic, twins are ideal subjects for research on heredity, because they 
have identical sets of genes. In such cases, the two individuals have grown 

from a single fertilized egg cell that 


Figure 24, Correlations be- 
tween pairs of children in the 
same family for height and for 
intelligence. (Twin correlations 
from Woodworth, R. S. Hered- 
ity and environment. New York 
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In twin studies, two research 
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Figure 25. Average IQ difer- 
ences for pairs of children with 
different degrees of relationship 
(Woodworth, R. S. Heredity 
and environment. New York: 
Soc. Sci. Res. Council, 1941.) 
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statements is true in the case of intelligence and a number of other mental 
characteristics is that research points to a reality somewhere between the two 
extremes. As Figure 25 shows, the average difference between twins is greater 
for those reared apart than for those reared together, but it is nowhere nearly 
as great as the difference between unrelated children. 

It is these studies of identical twins reared in different homes and com- 
munities that have told us most about what aspect of the environment has 
the greatest effect on intelligence. In a thorough case study of 19 pairs of 
separated identical twins, H. H. Newman rated the environments separately 
for the educational and social advantages they provided. In all instances where 
one twin turned out to be much brighte: than another, educational influences 
differed markedly. But social advantages did not seem to have much effect. 

Other questions about the relative hereditary or environmental determina- 
tion of various psychological characteristics have been approached in several 
other ways—studies of family histories, research on children in foster homes, 
animal-breeding studies. All these lines of investigation point to the general 
conclusion that many or most abilities and personality traits have some heredi- 
tary basis, but that their development is also affected by environmental in- 
fluences. What we need now is research leading to more specific conclusions: 
What sort of effect do certain kinds of environment have on particular heredi- 
tary tendencies? We have passed the stage of argument over heredity versus 
environment, but we still have a great deal to survey 


and map in the area such 
arguments used to cover. 
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Philosophical psychology (see Psychology. 
philosophical) 
Physical traits, 105-106 
Power tests (see Tests, power) 
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